Basset
Basset is a full-text PHP Information Retrieval library. This is a collection of developments in the field of IR and ported over to PHP for research purposes.Basset provides different ways of searching through documents in a collection (ad-hoc), by applying advanced and experimental IR algorithms and/or techniques gathered from different Research studies and Conferences, most notably:
Basics
Warning: This is a tool that is continuously under development. Please use this as a research tool for your otherwise special Production needs.
Adding Documents
Basset manages adding document thru the IndexWriter Class.It processes the documents you'll be adding in and later on commit to an external file.
It takes a directory path, and overwrite (they both default to '../index/' and true consecutively).
Setting overwrite to false means that you won't be accidentally overwriting any existing index inside the directory.
Methods:
addDocument(DocumentInterface $d, $metadata = null) - requires TokensDocument instance which is a plain wrapper for Tokenized documents. It can also take in a MetaData class which simply stores any information regarding the document (be it a URL, file path or a title), just make sure it's keyed properly.
applyTransformation(TransformationInterface $transformer) - requires any form of transformation that will be done to each documents. (You can read about them upon scrolling down further).
setFilename(string $filename) - custom filename for the commited file. If not set, it will create a file named basset_index (.idx is always the extension, no need to specify this).
close() - this closes the IndexWriter, which means adding documents will be stopped from here. Also, all index processing and file writing is commited at this point. All above methods will not be allowed.
getLocation() - returns a string location for the created file.
Basset only tokenizes .txt, as much as I'd like to place dependencies here for 3rd party PDF/Doc/Rtf/HTML parsers, it's up to you to actually choose which ones you like to parse your documents with and feed it to the tokenizer.
use Basset\Tokenizers\WhitespaceTokenizer;
use Basset\Index\IndexWriter;
use Basset\MetaData\MetaData;
// Get files
$doc1 = file_get_contents('../location/for/file1.txt');
$doc2 = file_get_contents('../location/for/file2.txt');
$doc3 = file_get_contents('../location/for/file3.txt');
// Instantiate tokenizer
$tokenizer = new WhitespaceTokenizer();
/*
* Add docs to Index and transform.
*/
$indexWriter = new IndexWriter();
$indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc1)), new MetaData(array('title' => 'Deadpool')));
$indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc2)), new MetaData(array('title' => 'BigFish')));
$indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc3)), new MetaData(array('title' => 'American Sniper')));
You can either use WhitespaceTokenizer, which breaks it down for every space and newline, or WhitespaceAndPunctuationTokenizer, which also tokenizes punctuations.
By default, it does not require metadata at all, but doing so allows you to identify which is which later on.
Note:
The documentID used will be their key offset, i.e. 0-N where N is the last document added.
When tagging metadata, if two metadata tags are the same, the latter will always overwrite the former.
Similarly, a query is also a form of a document in Basset (albeit a shorter document. In IR, query is called "pseudo-document"), you will see in a wrapping-up section later how a query should be set.
$query = new Document(new TokensDocument($tokenizer->tokenize('BIG fish deadpool')));
Transforming Documents
Before indexing documents, different transformations can be done to pre-process each documents.From tokenization, to canonicalizing tokens, to filtering words that has little value, to removal of derivational affixes, to morphological analysis, etc. all with the purpose of simplifying succeeding process/es and, in the case of Basset, to alleviate the issue of not giving proper weights/ranks/scores to documents 'that actually matter' to a query.
The following is not within Basset's scope, but pretty much any NLP tools can help:
- Language-classifier based Tokenizers
- Language-classifier based Normalizers
- Filters (Stopword, Date, Email/Domain, Special characters, etc.)
- Thesauri
- Lemmatization
- Stemming
Normalizing documents ensures that each tokens are 'canonical' and equivalent to provide better expectation of the result.
Take note that Basset's default (i.e. English Class) is simply transforming all tokens to lowercase.
use Basset\Normalizers\English;
use Basset\Utils\TransformationSet;
use Basset\Utils\StopWords;
$stopwords = file_get_contents('../location/for/stopwords.txt');
// Specify Transformations
$transformations = array(
new English(),
new StopWords($tokenizer->tokenize($stopwords))
);
// Initiate TransformationSet
$transform = new TransformationSet();
$transform->register($transformations);
Once instantiated, it can be applied to the IndexWriter. (so long as close() hasn't been called) $indexWriter->applyTransformation($transform);
applyTransformation method also accepts single transformations instead of going through the TransformationSet.
$indexWriter->applyTransformation(new English());
Stopwords can also take its own normalization. English is set as default. new StopWords($tokenizer->tokenize($stopwords), new English());
Remember that the same process/es done to documents is expected to be done on queries, as query is also a form of document, if not in IR field but in Basset at least. (We cannot search for 'and', if 'and' is filtered on all documents before indexing)
Indexing Documents
Once the documents are tokenized, added and transformed, the index had to be processed.From Wikipedia, indexing collects, parses, and stores data to facilitate fast and accurate information retrieval.
IndexWriter processes the index once close() method is called. It commits and starts counting and writing the counted data to a file (.idx).
The index file aggregates basic statistics for the document collection.
It counts the following 2 information. CollectionStatistics and EntryStatistics:
CollectionStatistics
This statistics is an overall stats for the collection itself.
It has the following methods(replace get with set for setters):
- getAverageDocumentLength() - Returns average doc length (∑TF/N).
- getNumberOfDocuments() - Returns number of documents in the collection (N).
- getNumberOfTokens() - Returns total number of all tokens in the entire collection (∑TF).
- getNumberOfUniqueTokens() - Returns the total number of unique terms in the collection.
EntryStatistics
This class provides basic statistics about a lexical entry (a term), we consider all documents added as 'lexicon' or 'corpus'.It has the following methods(replace get with set for setters):
- getTermFrequency() - Returns number of occurences of the term in the entire corpus (TFt).
- getDocumentFrequency() - Returns number of documents containing the term in the entire corpus. (DFt).
- getTotalByTermPresence() - Returns the total number of tokens in the only set of documents where the term appears.
- getUniqueTotalByTermPresence() - Returns the total number of unique terms in the only set of documents where the term appears.
- getPostingList() - Returns the array of posting lists for the term.
PostingStatistics(DocumentInterface)
This class indexes and represents one posting (a document) in a posting list. Take note that PostingStatistics does not concern itself with the entire corpus' stats (tf here is local, not to be confused with the corpus termFrequency in the EntryStatistics).It has the following methods(replace get with set for setters):
- getTf() - Return the frequency of the term in the current document. (tf)
Once, the indexing is done, it will look pretty much like this:
id | tf(corpus) | df | ttp | uttp | postingList |
---|---|---|---|---|---|
'word' | 2500 | 150 | 16000 | 4000 | array('document1' => 540, 'document2' => 60) .... |
In addition, it also collects metadata infos per documents added.
IndexReader
Once created, IndexReader() takes care of reading it.
If a custom directory path is created, meaning you've actually set a different destination folder as part of IndexReader's parameter (and an optional custom file name thru setFileName()), you need to specify that path in IndexReader(), otherwise it'll just look for a default file (IndexWriter() without parameters always writes index at /index/basset_index.idx).
Pointer: IndexWriter()->getLocation() returns a string location for the created index file. Once you have close() the indexing from the IndexWriter, and it's ready for reading, and it happens to reside on a custom directory having a custom name, you can get its location using this method. Which can then be used with the IndexReader like new IndexReader($indexWriter->getLocation())
Pointer: IndexWriter()->getLocation() returns a string location for the created index file. Once you have close() the indexing from the IndexWriter, and it's ready for reading, and it happens to reside on a custom directory having a custom name, you can get its location using this method. Which can then be used with the IndexReader like new IndexReader($indexWriter->getLocation())
The created index file acts as the persistent data for all documents added in the collection. Once created, you can always call the file to avoid re-indexing the same collection during run-time (if you're simply trying out different models for the same collection, i.e., NPL, Medline, TREC, etc.).
What comes next now is a "wrapping-up" section for all of these and we'll come up with....
What comes next now is a "wrapping-up" section for all of these and we'll come up with....
Search
use Basset\Index\IndexReader;
use Basset\Index\IndexWriter;
use Basset\Documents\TokensDocument;
use Basset\Documents\Document;
use Basset\Normalizers\English;
use Basset\Tokenizers\WhitespaceTokenizer;
use Basset\Utils\StopWords;
use Basset\Utils\TransformationSet;
use Basset\MetaData\MetaData;
// Different Expansion models
use Basset\Expansion\IdeDecHi;
use Basset\Expansion\IdeRegular;
use Basset\Expansion\Rocchio;
use Basset\Expansion\RelevanceModel;
use Basset\Expansion\GeneticAlgorithm;
use Basset\Expansion\DifferentialEvolution;
use Basset\Expansion\CauchyDE;
use Basset\Expansion\SelfAdaptiveDE;
// Make Everything happen here
use Basset\Search\Search;
// Make Language Models happy
use Basset\Models\DirichletLM;
// Make Probabilistic Models happy
use Basset\Models\BM25;
// Make VS Models happy
use Basset\Models\TfIdf;
// Make DFR Models happy
use Basset\Models\DFRModel;
use Basset\Models\Normalization\NormalizationH1;
use Basset\Models\DFRModels\In;
use Basset\Models\DFRAfterEffect\B;
// Make DFI Models happy
use Basset\Models\DFIModel;
use Basset\Models\DFIModels\ChiSquared;
use Basset\Models\Idf;
// Make IB Models happy
use Basset\Models\IBModel;
use Basset\Models\Normalization\NormalizationH2;
use Basset\Models\IBDistribution\LLDistribution;
use Basset\Models\IBLambda\LambdaDF;
/*
*
* Load the documents and pre-process it. (I use screenplays from imsdb.com as document samples)
*
*/
$doc1 = file_get_contents("test.txt");
$doc2 = file_get_contents('test2.txt');
$doc3 = file_get_contents('test3.txt');
$doc4 = file_get_contents('test4.txt');
$doc5 = file_get_contents('test5.txt');
$stopwords = file_get_contents('stopwords.txt');
$tokenizer = new WhitespaceTokenizer();
$english= new English();
$filter = new StopWords($tokenizer->tokenize($stopwords));
$stemmer= new RegexStemmer('/(\w+)(?=ly|es|s|y)/');
$transformations = array(
$english,
$filter,
$stemmer
);
$transform = new TransformationSet();
$transform->register($transformations);
/*
* Set the Collection
*/
$indexWriter = new IndexWriter();
$indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc1)), new MetaData(array('title' => 'Deadpool')));
$indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc2)), new MetaData(array('title' => 'BigFish')));
$indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc3)), new MetaData(array('title' => 'American Sniper')));
$indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc4)), new MetaData(array('title' => 'La La Land')));
$indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc5)), new MetaData(array('title' => 'The Revenant')));
$indexWriter->applyTransformation($transform);
$indexWriter->close();
/*
* Start Reading the Index file
* Since I'm using default path and name, just start IndexReader in default.
*/
$indexReader = new IndexReader();
/*
* Initiate Query
* All registered pre-processing transformations should fix and normalize this grimly format of a query.
*/
$query = new Document(new TokensDocument($tokenizer->tokenize('bIg bEaR and fIshy, tHen deadpool')));
$query->applyTransformation($transform);
/*
* START SEARCH
* This is the point where you decide the models and metric to use for scoring.
*/
/*
* Set LM Ranking models
*/
$search = new Search($indexReader);
$search->query($query);
$search->model(new DirichletLM);
/*
* Set Probabilistic Ranking models
*/
$search = new Search($indexReader);
$search->query($query);
$search->model(new BM25);
/*
* Set VSM Ranking models
*/
$search = new Search($indexReader);
$search->query($query);
$search->model(new TfIdf);
/*
* Set DFR Ranking models
*/
$search = new Search($indexReader);
$search->query($query);
$search->model(new DFRModel(new In, new B, new NormalizationH1));
/*
* Set DFI Ranking models
*/
$search = new Search($indexReader);
$search->query($query);
$search->model(new DFIModel(new ChiSquared, new Idf));
/*
* Set IB Ranking models
*/
$search = new Search($indexReader);
$search->query($query);
$search->model(new IBModel(new LLDistribution, new LambdaDF, new NormalizationH1));
/*
* You can enable QueryExpansion if you wish for a much better recall/precision.
* All expansion models default to top 20 relevant and top 10 non-relevant docs and querylength + 30 top terms to be used for expansion.
* They also take their own parameters, so please read each classes on how and where they're derived from.
* Evolutionary Algorithms (EA) is an experiment I do that only takes in the fittest document derived from 100 generations.
*/
// For VSM and probabilistic models
$search->setQueryExpansion(new IdeDecHi);
$search->setQueryExpansion(new Rocchio);
$search->setQueryExpansion(new IdeRegular);
$search->setQueryExpansion(new GeneticAlgorithm);
$search->setQueryExpansion(new DifferentialEvolution);
$search->setQueryExpansion(new CauchyDE);
$search->setQueryExpansion(new SelfAdaptiveDE);
// OR for language models
$search->setQueryExpansion(new RelevanceModel);
// EVERY search ends with search()
$results = $search->search();
/**
* search() returns an instance of ResultSet object, and can take a $limit number and boolean $descending as
* parameter to display stuff. The more documents you add, the more you'll be overwhelmed by results. (default is search(10, true)).
* ResultSet displays result as array thru getResults(), it has docID, score and the given MetaData for the document.
* Take note, if you've set the model() as TfIdf but similarity() as a distance metric(see Metric namespace for metrics),
* and you want ascending order and top 15 items. You'll do...
*/
$results = $search->search(15, false);
/*
* As explained above, it returns a ResultSet object, which collects the docId, score and Metadata for each docs.
* So to display it, you'd have to do..
*/
$display = array();
foreach($results->getResults() as $result) {
$title = $result->getMetaData()->getTag('title'); //getting the title tag from metadata added for the doc earlier.
$display[$title] = $result->getScore();
}
print_r($display);
Weighting Models
Each Weighting Model may or may not accept parameters, so please read each Model to see which papers they're derived from and which parameters are required (or simply locked and set, like log base of DFR's Idf, for instance, which always uses binary log) and values set by default. Basset's Search class is the base wrapper for literally everything (they were initially separated during Basset's pre-release stage, but eventually centralized) from Probabilistic Relevance, Language Models, DFR, DFI, IB and VS Models.Probabilistic Models
Sample Usage:$search = new Search(IndexReader);
$search->query(DocumentInterface class);
$search->model(WeightedModelInterface class);
$search->queryModel(WeightedModelInterface class); // not required as query models are instantiated in whatever model you chose above, this is just a way of changing the default.
$search->similarity(SimilarityInterface or DistanceInterface type); // not required as metric measures are instantiated in whatever model you chose above, this is just a way of changing the default.
$search->search(int $limit = 10, boolean $descending = true);
Probabilistic Relevance Models (Please read the following defaults for query models and metric used)
- BM25 - Okapi's Best Matching algorithm. TermCount is default for query model and VectorSimilarity is default for metric.
- ModBM25 - Ensures negative IDF don't violate Term-Frequency, Length Normalization and Tf-length Constraints. TermCount is default for query model and VectorSimilarity is default for metric.
- BM25L - A work of Lv and Zhai to rewrite BM25 due to Singhal et al's observation for having it penalized longer documents. TermCount is default for query model and VectorSimilarity is default for metric.
- BM25Plus - Based on Lv and Zhai's work for lower bounding BM25. TermCount is default for query model and VectorSimilarity is default for metric.
- AtireBM25- Uses Robertson-Walker IDF instead of the original Robertson-Sparck IDF. TermCount is default for query model and VectorSimilarity is default for metric.
- BSDS - Implements the Binary Standard Document Score (BSDS) with document length normalization. TermCount is default for query model and VectorSimilarity is default for metric.
- XSqrA_M - Inner product of Pearson's X^2 with the information growth computed with the multinomial M. TermCount is default for query model and VectorSimilarity is default for metric.
- HiemstraLM - Based on Hiemstra's work. TermCount is default for query model and VectorSimilarity is default for metric.
- DirichletLM - Bayesian smoothing with Dirichlet Prior. TermFrequency is default for query model and VectorSimilarity is default for metric.
- JelinekMercerLM - Based on the Jelinek-Mercer smoothing method. TermFrequency is default for query model and VectorSimilarity is default for metric.
- AbsoluteDiscountingLM - Absolute Discounting smoothing method. TermFrequency is default for query model and VectorSimilarity is default for metric.
- TwoStageLM - Leave-one-out method. This is also a generalization of both DirichletLM and JelinekMercerLM methods. TermFrequency is default for query model and VectorSimilarity is default for metric.
SPUD Language Models - Based on Cummins et al's work. (Please read the following recommendation for query models)
Divergence-From-Independence (DFI). (Please read the following recommendation for query models)
- DirichletSPUD - Estimation of background DCM via Multivariate Polya Distribution with Bayesian smoothing with Dirichlet Prior. TermCount is default for query model and VectorSimilarity is default for metric. (This is not used at the moment as the way it extracts background model prolongs the indexing stage. Using this throws an exception for now)
- JelinekMercerSPUD - Estimation of background DCM via Multivariate Polya Distribution with the Jelinek-Mercer smoothing method. TermCount is default for query model and VectorSimilarity is default for metric.
- IRRA12 - Term weighting model developed on the basis of Shannon’s Information Theory. TermCount is default for query model and VectorSimilarity is default for metric.
Divergence-From-Independence Framework
DFI framework's document model wrapper is DFIModel. TermCount is default for query model and VectorSimilarity is default for metric.DFI models are obtained by instantiating the two components of the framework:
- Selecting a Divergence measure.
- Selecting an IDF method.
Usage:
DFIModel(DFIInterface $model, IDFInterface $idf)
Location:
Basset\Models\DFIModel;
Divergence Measures:
use Basset\Models\DFIModels\xxxxx;
- ChiSquared
- Saturated
- Standardized
use Basset\Models\xxxxx;
- Idf - Using log2(NumberofDocuments/DocumentFrequency) method.
- IdfOkapi - Using BM25's Idf. The paper used this for better accuracy.
- IdfDFR - Using DFR Framework's Idf. (this uses binary log)
- IdfSparckRobertson - Robertson-Sparck's IDF.
Information-Based Framework
Information-Based framework's document model wrapper is IBModel. TermCount is default for query model and VectorSimilarity is default for metric.IB models are obtained by instantiating the three components of the framework:
- Selecting a Probabilistic Distribution.
- Selecting the method for getting the lambda parameter
- Normalising the term frequencies (Normalization based on DFR framework).
IBModel(IBDistributionInterface $model, IBLambdaInterface $lambda, NormalizationInterface $normalization)
Location:
Basset\Models\IBModel;
Probabilistic Distributions:
use Basset\Models\IBDistribution\xxxxx;
- LLDistribution - The log-logistic Distribution.
- SPLDistribution - A Smoothed Power-Law (SPL) Distribution.
use Basset\Models\IBLambda\xxxxxxxx
- LambdaDF - Using average number of documents where term occurs.
- LambdaTTF - Using average number of term occurences in the collection.
use Basset\Models\Normalization\xxxxx;
- (See DFR Normalization options)
Divergence-From-Randomness Framework
DFR framework's document model wrapper is DFRModel. TermCount is default for query model and VectorSimilarity is default for metric. DFR models are obtained by instantiating the three components of the framework:- Selecting a basic randomness model.
- Applying the first normalisation.
- Normalising the term frequencies.
DFRModel(BasicModelInterface $model, AfterEffectInterface $aftereffect, NormalizationInterface $normalization)
Location:
Basset\Models\DFRModel;
Basic Models:
use Basset\Models\DFRModels\xxxxx;
- P - Approximation of the binomial.
- BE - Bose-Einstein distribution.
- G - Geometric approximation of the Bose-Einstein.
- In - Inverse Document Frequency model.
- InFreq - Inverse Term Frequency model.
- InExp - Inverse Expected Document Frequency model.
- L - Laplace’s law of succession.
- B - Ratio of two Bernoulli processes.
- NormalizationH1 - Uniform distribution of the term frequency.
- NormalizationH2 - The term frequency density is inversely related to the length.
- NormalizationH2E - We use natural log instead of log2.
- NormalizationBM25 - BM25's normalization.
- NormalizationDP - Bayesian smoothing with Dirichlet Priors normalization.
- NormalizationJMDF - Jelinek-Mercer smoothing normalization using docfrequency.
- NormalizationJMTF - Jelinek-Mercer smoothing normalization using termfrequency.
- NormalizationP - Term Frequency Normalisation via Pareto Distributions.
- NormalizationF - Using increasing density function for the frequency normalisation.
Algebraic Calculations
Sample Usage:
$search = new Search(IndexReader);
$search->query(DocumentInterface class);
$search->model(WeightedModelInterface class);
$search->queryModel(WeightedModelInterface class); // not required and will never be used, as query models will and will always be the same as the model you chose above.
$search->similarity(SimilarityInterface or DistanceInterface type); // not required as metric measures is instantiated in whatever model you chose above, this is just a way of changing the default (and it had to be a VSMInterface type or it will throw an error).
$search->search(int $limit = 10, boolean $descending = true);
Algebraic Models in Basset represents the document and weighting methods (like tf-idf models for BOTH query and document models) to it's vector form, unlike the previous models (which only gets the score of each term if they're also present in the query, i.e. vector product), it focuses on vector operations like deviation of angles between the query and documents' representation, set operations and so on. Whatever weighting model used to represent terms in document, will be the same used on the query.
Take note also, that all calculations are smoothed to allow for a 1 vs. 1 document checking (idf simply gives out log(1/1) == 0 for a single document corpus and we don't want that if we just wanted to compare single document).
$search = new Search(IndexReader);
$search->query(DocumentInterface class);
$search->model(WeightedModelInterface class);
$search->queryModel(WeightedModelInterface class); // not required and will never be used, as query models will and will always be the same as the model you chose above.
$search->similarity(SimilarityInterface or DistanceInterface type); // not required as metric measures is instantiated in whatever model you chose above, this is just a way of changing the default (and it had to be a VSMInterface type or it will throw an error).
$search->search(int $limit = 10, boolean $descending = true);
Algebraic Models in Basset represents the document and weighting methods (like tf-idf models for BOTH query and document models) to it's vector form, unlike the previous models (which only gets the score of each term if they're also present in the query, i.e. vector product), it focuses on vector operations like deviation of angles between the query and documents' representation, set operations and so on. Whatever weighting model used to represent terms in document, will be the same used on the query.
Take note also, that all calculations are smoothed to allow for a 1 vs. 1 document checking (idf simply gives out log(1/1) == 0 for a single document corpus and we don't want that if we just wanted to compare single document).
Presently, you can use any form of weighting model, but in a classical fashion, any simple TfIdf types is all you need:
- TfIdf
- PivotedTfIdf
- LemurTfIdf
- PivotedConcaveTFIDF
Similarity/Distance metrics:
Models like the above mentioned have a default similarity metric of CosineSimilarity, this can be changed thru similarity() method. It's a rule that whenever a metric used is of VSMInterface (i.e. not VectorSimilarity, the only similarity not of VSMInterface type), it requires that both models for query and documents are the same (there's no point in getting a similarity between documents not similarly represented or weighted, weight represents the term in relation to to the entire collection, and they are used as identifiers while being measured).
Pointer: similarity() replaces default similarity set in models
You can create your own as there is a family of notation for these (see SMART system)
Once vectorized, similarity or distance metric between the query and document vectors can be done by different options listed below:
Similarity/Distance Types (SimilarityInterface and DistanceInterface both extends MetricInterface):Once vectorized, similarity or distance metric between the query and document vectors can be done by different options listed below:
Each classes have the references on how they're derived.
You can implement your own as long as it implements the proper Interface (SimilarityInterface or DistanceInterface AND VSMInterface, as they will and should not be used in a Probabilistic type of retrieval models).
There has been a family for these measures classified for specific purposes. You can evaluate each one with an evaluation method of your own, as Basset currently have none.
You have to know that Distance should always be the reverse of Similarity, i.e., it converges to 1 as the document meets the query and 0 otherwise, but there are some measures that appears to be a distance metric, but they actually converges 1 (or even greater, as some aren't even between 0..1).
Divergences such as that should be taken into consideration when using these methods, as it may impact a business logic when they are interchanged while coding. (It is important to read the references surrounding these measures before actually using them)
- CosineSimilarity - Vectors computed thru Cosine similarity.
- SqrtCosineSimilarity - Square-root Cosine implementation using HellingerDistance norm.
- TriangleSectorSimilarity - A Hybrid Geometric Approach for cosine similarity.
- DiceSimilarity - Be wary that it is nice to use set operations if you have a larger collection (i.e. a wide spectrum of TfIdf values to represent each term, as smaller collection may be exposed to duplicate in values specially those frequently occuring terms, i.e. stop words)
- JaccardIndex
- TverskyIndex - A Generalization of JaccardIndex and DiceSimilarity. TverskyIndex accepts parameters, please take a look at the class.
- KLDivergence - Kullback-Leibler Divergence.
- JSDivergence - JensenShannon Divergence.
- RenyiDivergence - A generalization of both KL and JS divergence.
- KulczynskiDistance
- OverlapCoefficient
- BrayCurtisDistance
- CanberraDistance
- ChebyshevDistance
- ChiSquareDistance
- HellingerDistance
- LorentzianDistance
- MatusitaDistance
- MotykaSimilarity
- RuzickaSimilarity
- SoergleDistance
- StamatatosDistance
- EuclideanDistance
- CzekanowskiSimilarity
Using the Cranfield Collection and the sample.php file
Cranfield Collection has been the pioneer collection in information retrieval to validate a system's effectiveness.
I've included the 1400 abstract Cranfield Collection as an XML file that you can parse into separate files. (There is one in the Cranfield folder and it's used in the sample.php file)
The test file at tests/sample.php can be executed right away to do the parsing and do a search for a single test query. (the default is set to do ModBM25)
Customize it to your needs if needed.
You can read Cranfield/cranfield-collection/cranqrel for Glassgow's qrels result.
I've also included SMART system's stopword list for standardization (see stopwords/stopwords.txt).
This is what's currently on the sample sample.php
TO-DO:
Complete coverage of tests.
Proximity ranking (MarkovRandomFields) - might review indexing to allow blocks.
Allow index updates and deletion.
P.S.
This is a library. It can only help your own solutions, or with your research. You can use each classes separately and expand on it, do experiments and stuff, and if you find this helpful and wish to help, feel free to submit PRs. :).
This is what's currently on the sample sample.php
include_once(__DIR__.'/../autoload.php');
include_once(__DIR__.'/../Cranfield/cranfield_parser.php');
use Basset\Documents\Document;
use Basset\Documents\TokensDocument;
use Basset\MetaData\MetaData;
use Basset\Search\Search;
use Basset\Models\TfIdf;
use Basset\Models\ModBM25;
use Basset\Metric\SqrtCosineSimilarity;
use Basset\Normalizers\English;
use Basset\Tokenizers\WhitespaceAndPunctuationTokenizer;
use Basset\Utils\StopWords;
use Basset\Utils\TransformationSet;
use Basset\Index\IndexReader;
use Basset\Index\IndexWriter;
use Basset\Models\DFIModel;
use Basset\Models\DFIModels\ChiSquared;
use Basset\Models\Idf;
use Basset\Expansion\IdeDecHi;
class Similarity {
public function test() {
$start = microtime(true);
// Initialized required stuff.
// THE DOCUMENTS
$cranfield = new CranfieldParser(__DIR__.'/../Cranfield/cranfield-collection/cran.all.1400.xml-format.xml');
$documents = $cranfield->parse();
// This is a set of NLP stuff used to analyze each tokens(terms) in a given Document.
$stopwords = file_get_contents(__DIR__.'/../stopwords/stopwords.txt');
$tokenizer = new WhitespaceAndPunctuationTokenizer;
$pipeline = array(
new StopWords($tokenizer->tokenize($stopwords)),
new English,
// also stemmer if you have any, as I don't have any. Make sure it implements TransformationInterface
);
$transform = new TransformationSet;
$transform->register($pipeline);
/**
*
* Start indexing files.
* IndexWriter() takes an optional directory path, and it will create an index folder with an .idx file inside.
*
* Everything is commited to disk once close() is called, otherwise you can keep adding document/s.
* The file created is Basset's inverted index file.
*
* Once created, IndexReader() takes care of reading it.
*
* If a custom directory path is created (and an optional custom file name thru setFileName()), you need
* to specify the path in IndexReader(), otherwise it'll just look for a default file (index/basset_index.idx).
*
* The created index file acts as the persistent data for all documents added in the collection. Once created,
* you can comment out the code below to avoid re-indexing the same collection during run-time (if you're
* simply trying out different models for the same collection, e.g., NPL, Medline, TREC, etc.).
*
*/
$index = new IndexWriter(__DIR__.'/../custom_index');
$index->setFileName('mycustomindex');
foreach($documents as $title => $body){
$index->addDocument(new TokensDocument($tokenizer->tokenize($body)), new MetaData(array('title' => $title)));
}
$index->applyTransformation($transform);
$index->close();
// MetaData class is a wrapper for assigning any array of info for a given doc, be it a title, path or a url, etc.
/**
* Dumping $index->getLocation() gives '../custom_index/mycustomindex.idx' which should be fed as parameter
* for IndexReader.
*/
// prepare one query as Document instance from Cranfield/cranfield-collection/cran.qry.xml-format
$query = new Document(new TokensDocument($tokenizer->tokenize('what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft.')));
$query->applyTransformation($transform);
/**
*
* Start search.
*
* There has been changes in class name and operations since the v1 release (to accomodate for
* structural changes).
*
* DocumentRanking became Search(mostly working as a manager for everything) and requires an IndexReader
* instance.
*
* Weighting Models are set thru model(), where the weighting model used for the query and the metric for
* comparing the query against the documents are explicitly set.
* You can still change them thru queryModel() and similarity(), and the info regarding the defaults are
* given from the docs at
* https://myth-of-sissyphus.blogspot.com/2018/02/basset-information-retrieval-library-in.html
*
*/
$indexReader = new IndexReader(__DIR__.'/../custom_index/mycustomindex.idx'); // read the custom index specified above
$search = new Search($indexReader);
$search->query($query);
$search->model(new ModBM25);
$search->setQueryExpansion(new IdeDecHi); //all expansion models default to top 10 relevant and non-relevant docs and querylength + 100 top terms to be used for expansion.
$results = $search->search(15); // default: 10
$display = array();
foreach($results->getResults() as $key => $result) {
$title = $result->getMetaData()->getTag('title'); //getting the title tag from metadata added for the doc.
$display[$title] = $result->getScore();
}
print_r($display); // top K docs
print_r(microtime(true) - $start . "\xA");
/**
* search() returns an instance of ResultSet in descending order, and can take a $limit number and boolean $descending as
* parameter to display stuff, as 1400 items is a lot of stuff (default is search(10, true)).
* ResultSet displays result as array thru getResults(), it has docID, score and the given MetaData for the document.
*/
}
}
// parse Cranfield xml first before getting relevance
$sim = new Similarity;
$sim->test();
TO-DO:
Complete coverage of tests.
Proximity ranking (MarkovRandomFields) - might review indexing to allow blocks.
Allow index updates and deletion.