Skip to main content

Basset - Information Retrieval Library in PHP

Basset

Basset is a full-text  PHP Information Retrieval library. This is a collection of developments in the field of IR and ported over to PHP for research purposes.

Basset provides different ways of searching through documents in a collection (ad-hoc), by applying advanced and experimental IR algorithms and/or techniques gathered from different Research studies and Conferences, most notably:
  1. TREC
  2. SIGIR
  3. ECIR
  4. ACM

Basics

Warning: This is a tool that is continuously under development. Please use this as a research tool for your otherwise special Production needs.

Adding Documents

Basset manages adding document thru the IndexWriter Class.
It processes the documents you'll be adding in and later on commit to an external file.
It takes a directory path, and overwrite (they both default to '../index/' and true consecutively).
Setting overwrite to false means that you won't be accidentally overwriting any existing index inside the directory.

Methods:

addDocument(DocumentInterface $d, $metadata = null) - requires TokensDocument instance which is a plain wrapper for Tokenized documents. It can also take in a MetaData class which simply stores any information regarding the document (be it a URL, file path or a title), just make sure it's keyed properly.
applyTransformation(TransformationInterface $transformer) - requires any form of transformation that will be done to each documents. (You can read about them upon scrolling down further).
setFilename(string $filename) - custom filename for the commited file. If not set, it will create a file named basset_index (.idx is always the extension, no need to specify this).
close() - this closes the IndexWriter, which means adding documents will be stopped from here. Also, all index processing and file writing is commited at this point. All above methods will not be allowed.
getLocation() - returns a string location for the created file.

Basset only tokenizes .txt, as much as I'd like to place dependencies here for 3rd party PDF/Doc/Rtf/HTML parsers, it's up to you to actually choose which ones you like to parse your documents with and feed it to the tokenizer.

use Basset\Tokenizers\WhitespaceTokenizer;
use Basset\Index\IndexWriter;
use Basset\MetaData\MetaData;


// Get files
$doc1 = file_get_contents('../location/for/file1.txt');
$doc2 = file_get_contents('../location/for/file2.txt');
$doc3 = file_get_contents('../location/for/file3.txt');

// Instantiate tokenizer
    $tokenizer = new WhitespaceTokenizer();
/* 
 * Add docs to Index and transform.
 */
    $indexWriter = new IndexWriter(); 
    $indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc1)), new MetaData(array('title' => 'Deadpool')));
    $indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc2)), new MetaData(array('title' => 'BigFish')));
    $indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc3)), new MetaData(array('title' => 'American Sniper')));


You can either use WhitespaceTokenizer, which breaks it down for every space and newline, or WhitespaceAndPunctuationTokenizer, which also tokenizes punctuations.

By default, it does not require metadata at all, but doing so allows you to identify which is which later on.

Note:
The documentID used will be their key offset, i.e. 0-N where N is the last document added.
When tagging metadata, if two metadata tags are the same, the latter will always overwrite the former.

Similarly, a query is also a form of a document in Basset (albeit a shorter document. In IR, query is called "pseudo-document"), you will see in a wrapping-up section later how a query should be set.

$query = new Document(new TokensDocument($tokenizer->tokenize('BIG fish deadpool')));

Transforming Documents

Before indexing documents, different transformations can be done to pre-process each documents.
From tokenization, to canonicalizing tokens, to filtering words that has little value, to removal of derivational affixes, to morphological analysis, etc. all with the purpose of simplifying succeeding process/es and, in the case of Basset, to alleviate the issue of not giving proper weights/ranks/scores to documents 'that actually matter' to a query.

The following is not within Basset's scope, but pretty much any NLP tools can help:
  • Language-classifier based Tokenizers
  • Language-classifier based Normalizers
  • Filters (Stopword, Date, Email/Domain, Special characters, etc.)
  • Thesauri
  • Lemmatization
  • Stemming
One such is php-nlp-tools (though you may have to build on top of it to make your own language-based classification and/or custom stuff).

Normalizing documents ensures that each tokens are 'canonical' and equivalent to provide better expectation of the result.

Take note that Basset's default (i.e. English Class) is simply transforming all tokens to lowercase.


use Basset\Normalizers\English;
use Basset\Utils\TransformationSet;
use Basset\Utils\StopWords;

$stopwords = file_get_contents('../location/for/stopwords.txt');

// Specify Transformations
    $transformations = array(
                      new English(),
                      new StopWords($tokenizer->tokenize($stopwords))
                      );

// Initiate TransformationSet
    $transform = new TransformationSet();
    $transform->register($transformations);
Once instantiated, it can be applied to the IndexWriter. (so long as close() hasn't been called)

     $indexWriter->applyTransformation($transform);

applyTransformation method also accepts single transformations instead of going through the TransformationSet.

     $indexWriter->applyTransformation(new English());
Stopwords can also take its own normalization. English is set as default.

    new StopWords($tokenizer->tokenize($stopwords), new English());

Remember that the same process/es done to documents is expected to be done on queries, as query is also a form of document, if not in IR field but in Basset at least. (We cannot search for 'and', if 'and' is filtered on all documents before indexing)

Indexing Documents

Once the documents are tokenized, added and transformed, the index had to be processed.
From Wikipedia, indexing collects, parses, and stores data to facilitate fast and accurate information retrieval.

IndexWriter processes the index once close() method is called. It commits and starts counting and writing the counted data to a file (.idx).

The index file aggregates basic statistics for the document collection.

It counts the following 2 information. CollectionStatistics and EntryStatistics:

CollectionStatistics

This statistics is an overall stats for the collection itself.

It has the following methods(replace get with set for setters):
  • getAverageDocumentLength() - Returns average doc length (∑TF/N).
  • getNumberOfDocuments() - Returns number of documents in the collection (N).
  • getNumberOfTokens() - Returns total number of all tokens in the entire collection (∑TF).
  • getNumberOfUniqueTokens() - Returns the total number of unique terms in the collection.

EntryStatistics

This class provides basic statistics about a lexical entry (a term), we consider all documents added as 'lexicon' or 'corpus'.

It has the following methods(replace get with set for setters):
  • getTermFrequency() - Returns number of occurences of the term in the entire corpus (TFt).
  • getDocumentFrequency() - Returns number of documents containing the term in the entire corpus. (DFt).
  • getTotalByTermPresence() - Returns the total number of tokens in the only set of documents where the term appears.
  • getUniqueTotalByTermPresence() - Returns the total number of unique terms in the only set of documents where the term appears.
  • getPostingList() - Returns the array of posting lists for the term.

PostingStatistics(DocumentInterface)

This class indexes and represents one posting (a document) in a posting list. Take note that PostingStatistics does not concern itself with the entire corpus' stats (tf here is local, not to be confused with the corpus termFrequency in the EntryStatistics).

It has the following methods(replace get with set for setters):
  • getTf() - Return the frequency of the term in the current document. (tf)

Once, the indexing is done, it will look pretty much like this:

id tf(corpus) df ttp uttp postingList
'word' 2500 150 16000 4000 array('document1' => 540, 'document2' => 60) ....

In addition, it also collects metadata infos per documents added.

IndexReader

Once created, IndexReader() takes care of reading it.

If a custom directory path is created, meaning you've actually set a different destination folder as part of IndexReader's parameter (and an optional custom file name thru setFileName()), you need to specify that path in IndexReader(), otherwise it'll just look for a default file (IndexWriter() without parameters always writes index at  /index/basset_index.idx).

Pointer: IndexWriter()->getLocation() returns a string location for the created index file. Once you have close() the indexing from the IndexWriter, and it's ready for reading, and it happens to reside on a custom directory having a custom name, you can get its location using this method. Which can then be used with the IndexReader like new IndexReader($indexWriter->getLocation())

The created index file acts as the persistent data for all documents added in the collection. Once created, you can always call the file to avoid re-indexing the same collection during run-time (if you're simply trying out different models for the same collection, i.e., NPL, Medline, TREC, etc.).

What comes next now is a "wrapping-up" section for all of these and we'll come up with....

Search

    
    use Basset\Index\IndexReader;
    use Basset\Index\IndexWriter;
    use Basset\Documents\TokensDocument;
    use Basset\Documents\Document;
    use Basset\Normalizers\English;
    use Basset\Tokenizers\WhitespaceTokenizer;
    use Basset\Utils\StopWords;
    use Basset\Utils\TransformationSet;
    use Basset\MetaData\MetaData;

    // Different Expansion models
    use Basset\Expansion\IdeDecHi;
    use Basset\Expansion\IdeRegular;
    use Basset\Expansion\Rocchio;
    use Basset\Expansion\RelevanceModel;
    use Basset\Expansion\GeneticAlgorithm;
    use Basset\Expansion\DifferentialEvolution;
    use Basset\Expansion\CauchyDE;
    use Basset\Expansion\SelfAdaptiveDE;

    // Make Everything happen here
    use Basset\Search\Search;

    // Make Language Models happy
    use Basset\Models\DirichletLM;

    // Make Probabilistic Models happy
    use Basset\Models\BM25;

    // Make VS Models happy
    use Basset\Models\TfIdf;
    
    // Make DFR Models happy
    use Basset\Models\DFRModel;
    use Basset\Models\Normalization\NormalizationH1;
    use Basset\Models\DFRModels\In;
    use Basset\Models\DFRAfterEffect\B;

    // Make DFI Models happy
    use Basset\Models\DFIModel;
    use Basset\Models\DFIModels\ChiSquared;
    use Basset\Models\Idf;

    // Make IB Models happy
    use Basset\Models\IBModel;
    use Basset\Models\Normalization\NormalizationH2;
    use Basset\Models\IBDistribution\LLDistribution;
    use Basset\Models\IBLambda\LambdaDF;


/*
 *
 * Load the documents and pre-process it. (I use screenplays from imsdb.com as document samples)
 *
 */
    $doc1 = file_get_contents("test.txt");
    $doc2 = file_get_contents('test2.txt');
    $doc3 = file_get_contents('test3.txt');
    $doc4 = file_get_contents('test4.txt');
    $doc5 = file_get_contents('test5.txt');

    $stopwords = file_get_contents('stopwords.txt');
    
    $tokenizer = new WhitespaceTokenizer();
    
    $english= new English();
    $filter = new StopWords($tokenizer->tokenize($stopwords));
    $stemmer= new RegexStemmer('/(\w+)(?=ly|es|s|y)/');

    $transformations = array(
                      $english,
                      $filter,
                      $stemmer
                      );

    $transform = new TransformationSet();
    $transform->register($transformations);

/* 
 * Set the Collection
 */
    $indexWriter = new IndexWriter(); 
    $indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc1)), new MetaData(array('title' => 'Deadpool')));
    $indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc2)), new MetaData(array('title' => 'BigFish')));
    $indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc3)), new MetaData(array('title' => 'American Sniper'))); 
    $indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc4)), new MetaData(array('title' => 'La La Land')));
    $indexWriter->addDocument(new TokensDocument($tokenizer->tokenize($doc5)), new MetaData(array('title' => 'The Revenant')));
    $indexWriter->applyTransformation($transform);
    $indexWriter->close();
/* 
 * Start Reading the Index file 
 * Since I'm using default path and name, just start IndexReader in default.
 */
   $indexReader = new IndexReader();

/* 
 * Initiate Query
 * All registered pre-processing transformations should fix and normalize this grimly format of a query.
 */
    $query = new Document(new TokensDocument($tokenizer->tokenize('bIg bEaR and fIshy, tHen deadpool')));
    $query->applyTransformation($transform);
/* 
 * START SEARCH
 * This is the point where you decide the models and metric to use for scoring.
 */

/* 
 * Set LM Ranking models
 */
    $search = new Search($indexReader);
    $search->query($query);
    $search->model(new DirichletLM);

/* 
 * Set Probabilistic Ranking models
 */
    $search = new Search($indexReader);
    $search->query($query);
    $search->model(new BM25);

/* 
 * Set VSM Ranking models
 */
    $search = new Search($indexReader);
    $search->query($query);
    $search->model(new TfIdf);


/* 
 * Set DFR Ranking models
 */
    $search = new Search($indexReader);
    $search->query($query);
    $search->model(new DFRModel(new In, new B, new NormalizationH1));

/* 
 * Set DFI Ranking models
 */
    $search = new Search($indexReader);
    $search->query($query);
    $search->model(new DFIModel(new ChiSquared, new Idf));

/* 
 * Set IB Ranking models
 */
    $search = new Search($indexReader);
    $search->query($query);
    $search->model(new IBModel(new LLDistribution, new LambdaDF, new NormalizationH1));

/*
 * You can enable QueryExpansion if you wish for a much better recall/precision.
 * All expansion models default to top 20 relevant and top 10 non-relevant docs and querylength + 30 top terms to be used for expansion.
 * They also take their own parameters, so please read each classes on how and where they're derived from.
 * Evolutionary Algorithms (EA) is an experiment I do that only takes in the fittest document derived from 100 generations.
 */
    // For VSM and probabilistic models
    $search->setQueryExpansion(new IdeDecHi);
    $search->setQueryExpansion(new Rocchio);
    $search->setQueryExpansion(new IdeRegular);
    $search->setQueryExpansion(new GeneticAlgorithm);
    $search->setQueryExpansion(new DifferentialEvolution);
    $search->setQueryExpansion(new CauchyDE);
    $search->setQueryExpansion(new SelfAdaptiveDE);

    // OR for language models
    $search->setQueryExpansion(new RelevanceModel);

// EVERY search ends with search()
    $results = $search->search(); 

/**
 * search() returns an instance of ResultSet object, and can take a $limit number and boolean $descending as 
 * parameter to display stuff. The more documents you add, the more you'll be overwhelmed by results. (default is search(10, true)).
 * ResultSet displays result as array thru getResults(), it has docID, score and the given MetaData for the document.
 * Take note, if you've set the model() as TfIdf but similarity() as a distance metric(see Metric namespace for metrics),
 * and you want ascending order and top 15 items. You'll do...
 */
    $results = $search->search(15, false);

/* 
 * As explained above, it returns a ResultSet object, which collects the docId, score and Metadata for each docs.
 * So to display it, you'd have to do..
 */
    $display = array();

    foreach($results->getResults() as $result) {
        $title = $result->getMetaData()->getTag('title'); //getting the title tag from metadata added for the doc earlier.
        $display[$title] = $result->getScore();
    }
        
    print_r($display);

Weighting Models

Each Weighting Model may or may not accept parameters, so please read each Model to see which papers they're derived from and which parameters are required (or simply locked and set, like log base of DFR's Idf, for instance, which always uses binary log) and values set by default. Basset's Search class is the base wrapper for literally everything (they were initially separated during Basset's pre-release stage, but eventually centralized) from Probabilistic Relevance, Language Models, DFR, DFI, IB and VS Models.

Probabilistic Models

Sample Usage:

$search = new Search(IndexReader);
$search->query(DocumentInterface class);
$search->model(WeightedModelInterface class);
$search->queryModel(WeightedModelInterface class); // not required as query models are instantiated in whatever model you chose above, this is just a way of changing the default.

$search->similarity(SimilarityInterface or DistanceInterface type); // not required as metric measures are instantiated in whatever model you chose above, this is just a way of changing the default.

$search->search(int $limit = 10, boolean $descending = true);

Probabilistic Relevance Models  (Please read the following defaults for query models and metric used)
  • BM25 - Okapi's Best Matching algorithm. TermCount is default for query model and VectorSimilarity is default for metric.
  • ModBM25 - Ensures negative IDF don't violate Term-Frequency, Length Normalization and Tf-length Constraints. TermCount is default for query model and VectorSimilarity is default for metric.
  • BM25L - A work of Lv and Zhai to rewrite BM25 due to Singhal et al's observation for having it penalized longer documents. TermCount is default for query model and VectorSimilarity is default for metric.
  • BM25Plus - Based on Lv and Zhai's work for lower bounding BM25. TermCount is default for query model and VectorSimilarity is default for metric.
  • AtireBM25- Uses Robertson-Walker IDF instead of the original Robertson-Sparck IDF. TermCount is default for query model and VectorSimilarity is default for metric.
  • BSDS - Implements the Binary Standard Document Score (BSDS) with document length normalization. TermCount is default for query model and VectorSimilarity is default for metric.
Ready-made Divergence-From-Randomness Models  (Please read the following recommendation for query models)
  • XSqrA_M - Inner product of Pearson's X^2 with the information growth computed with the multinomial M. TermCount is default for query model and VectorSimilarity is default for metric.
Language Models (Please read the following recommendation for query models)
  • HiemstraLM - Based on Hiemstra's workTermCount is default for query model and VectorSimilarity is default for metric.
  • DirichletLM - Bayesian smoothing with Dirichlet Prior. TermFrequency is default for query model and VectorSimilarity is default for metric.
  • JelinekMercerLM - Based on the Jelinek-Mercer smoothing method. TermFrequency is default for query model and VectorSimilarity is default for metric.
  • AbsoluteDiscountingLM - Absolute Discounting smoothing method. TermFrequency is default for query model and VectorSimilarity is default for metric.
  • TwoStageLM - Leave-one-out method. This is also a generalization of both DirichletLM and JelinekMercerLM methods. TermFrequency is default for query model and VectorSimilarity is default for metric.
SPUD Language Models - Based on Cummins et al's work (Please read the following recommendation for query models)
  • DirichletSPUD - Estimation of background DCM via Multivariate Polya Distribution with Bayesian smoothing with Dirichlet Prior. TermCount is default for query model and VectorSimilarity is default for metric. (This is not used at the moment as the way it extracts background model prolongs the indexing stage. Using this throws an exception for now)
  • JelinekMercerSPUD - Estimation of background DCM via Multivariate Polya Distribution with the Jelinek-Mercer smoothing method. TermCount is default for query model and VectorSimilarity is default for metric.
Divergence-From-Independence (DFI).  (Please read the following recommendation for query models)
  • IRRA12 - Term weighting model developed on the basis of Shannon’s Information Theory. TermCount is default for query model and VectorSimilarity is default for metric.

Divergence-From-Independence Framework

DFI framework's document model wrapper is DFIModel. TermCount is default for query model and VectorSimilarity is default for metric.
DFI models are obtained by instantiating the two components of the framework:
  1. Selecting a Divergence measure.
  2. Selecting an IDF method.
Take note that IRRA12 is also a part of it but uses different expectation measure than the rest. So it's left out with these chaps. (Note: The paper uses binary log for their Idf)
Usage:
DFIModel(DFIInterface $model, IDFInterface $idf)
Location:
Basset\Models\DFIModel;
Divergence Measures: 
use Basset\Models\DFIModels\xxxxx;
  • ChiSquared
  • Saturated
  • Standardized
IDF Methods (Use Models that implements IDFInterface):
use Basset\Models\xxxxx;
  • Idf - Using log2(NumberofDocuments/DocumentFrequency) method.
  • IdfOkapi - Using BM25's Idf. The paper used this for better accuracy.
  • IdfDFR - Using DFR Framework's Idf. (this uses binary log)
  • IdfSparckRobertson - Robertson-Sparck's IDF.

Information-Based Framework

Information-Based framework's document model wrapper is IBModelTermCount is default for query model and VectorSimilarity is default for metric.
IB models are obtained by instantiating the three components of the framework:
  1. Selecting a Probabilistic Distribution.
  2. Selecting the method for getting the lambda parameter
  3. Normalising the term frequencies (Normalization based on DFR framework).
Usage:
IBModel(IBDistributionInterface $model, IBLambdaInterface $lambda, NormalizationInterface $normalization)
Location:
Basset\Models\IBModel;
Probabilistic Distributions:
use Basset\Models\IBDistribution\xxxxx;
  • LLDistribution - The log-logistic Distribution.
  • SPLDistribution - A Smoothed Power-Law (SPL) Distribution.
Lambda Parameters:
use Basset\Models\IBLambda\xxxxxxxx
  • LambdaDF - Using average number of documents where term occurs.
  • LambdaTTF - Using average number of term occurences in the collection.
Normalization:
use Basset\Models\Normalization\xxxxx;
  • (See DFR Normalization options)

Divergence-From-Randomness Framework

DFR framework's document model wrapper is DFRModelTermCount is default for query model and VectorSimilarity is default for metric. DFR models are obtained by instantiating the three components of the framework:
  1. Selecting a basic randomness model.
  2. Applying the first normalisation.
  3. Normalising the term frequencies.
Usage:
DFRModel(BasicModelInterface $model, AfterEffectInterface $aftereffect, NormalizationInterface $normalization)
Location:
Basset\Models\DFRModel;
Basic Models:
use Basset\Models\DFRModels\xxxxx;
  • P - Approximation of the binomial.
  • BE - Bose-Einstein distribution.
  • G - Geometric approximation of the Bose-Einstein.
  • In - Inverse Document Frequency model.
  • InFreq - Inverse Term Frequency model.
  • InExp - Inverse Expected Document Frequency model.
After Effect: use Basset\Models\DFRAfterEffect\xxxxx;
  • L - Laplace’s law of succession.
  • B - Ratio of two Bernoulli processes.
Normalization: use Basset\Models\Normalization\xxxxx;
  • NormalizationH1 - Uniform distribution of the term frequency.
  • NormalizationH2 - The term frequency density is inversely related to the length.
  • NormalizationH2E - We use natural log instead of log2.
  • NormalizationBM25 - BM25's normalization.
  • NormalizationDP - Bayesian smoothing with Dirichlet Priors normalization.
  • NormalizationJMDF - Jelinek-Mercer smoothing normalization using docfrequency.
  • NormalizationJMTF - Jelinek-Mercer smoothing  normalization using termfrequency.
  • NormalizationP - Term Frequency Normalisation via Pareto Distributions.
  • NormalizationF - Using increasing density function for the frequency normalisation.

Algebraic Calculations

Sample Usage:

$search = new Search(IndexReader);
$search->query(DocumentInterface class);
$search->model(WeightedModelInterface class);

$search->queryModel(WeightedModelInterface class); // not required and will never be used, as query models will and will always be the same as the model you chose above.

$search->similarity(SimilarityInterface or DistanceInterface type); // not required as metric measures is instantiated in whatever model you chose above, this is just a way of changing the default (and it had to be a VSMInterface type or it will throw an error).

$search->search(int $limit = 10, boolean $descending = true);


Algebraic Models in Basset represents the document and weighting methods (like tf-idf models for BOTH query and document models) to it's vector form, unlike the previous models (which only gets the score of each term if they're also present in the query, i.e. vector product), it focuses on vector operations like deviation of angles between the query and documents' representation, set operations and so on. Whatever weighting model used to represent terms in document, will be the same used on the query.

Take note also, that all calculations are smoothed to allow for a 1 vs. 1 document checking (idf simply gives out log(1/1) == 0 for a single document corpus and we don't want that if we just wanted to compare single document).

Presently, you can use any form of weighting model, but in a classical fashion, any simple TfIdf types is all you need:
  • TfIdf
  • PivotedTfIdf
  • LemurTfIdf
  • PivotedConcaveTFIDF

Similarity/Distance metrics:


Models like the above mentioned have a default similarity metric of CosineSimilarity, this can be changed thru similarity() method. It's a rule that whenever a metric used is of VSMInterface (i.e. not VectorSimilarity, the only similarity not of VSMInterface type), it requires that both models for query and documents are the same (there's no point in getting a similarity between documents not similarly represented or weighted, weight represents the term in relation to to the entire collection, and they are used as identifiers while being measured).

Pointersimilarity() replaces default similarity set in models

You can create your own as there is a family of notation for these (see SMART system)
Once vectorized, similarity or distance metric between the query and document vectors can be done by different options listed below:

Similarity/Distance Types (SimilarityInterface and DistanceInterface both extends MetricInterface):

Each classes have the references on how they're derived.

You can implement your own as long as it implements the proper Interface (SimilarityInterface or DistanceInterface AND VSMInterface, as they will and should not be used in a Probabilistic type of retrieval models).

There has been a family for these measures classified for specific purposes. You can evaluate each one with an evaluation method of your own, as Basset currently have none.

You have to know that Distance should always be the reverse of Similarity, i.e., it converges to 1 as the document meets the query and 0 otherwise, but there are some measures that appears to be a distance metric, but they actually converges 1 (or even greater, as some aren't even between 0..1).
Divergences such as that should be taken into consideration when using these methods, as it may impact a business logic when they are interchanged while coding. (It is important to read the references surrounding these measures before actually using them)

  • CosineSimilarity - Vectors computed thru Cosine similarity.
  • SqrtCosineSimilarity - Square-root Cosine implementation using HellingerDistance norm.
  • TriangleSectorSimilarity - A Hybrid Geometric Approach for cosine similarity.
  • DiceSimilarity - Be wary that it is nice to use set operations if you have a larger collection (i.e. a wide spectrum of TfIdf values to represent each term, as smaller collection may be exposed to duplicate in values specially those frequently occuring terms, i.e. stop words)
  • JaccardIndex
  • TverskyIndex - A Generalization of JaccardIndex and DiceSimilarity. TverskyIndex accepts parameters, please take a look at the class.
  • KLDivergence - Kullback-Leibler Divergence.
  • JSDivergence - JensenShannon Divergence.
  • RenyiDivergence - A generalization of both KL and JS divergence.
  • KulczynskiDistance
  • OverlapCoefficient
  • BrayCurtisDistance
  • CanberraDistance
  • ChebyshevDistance
  • ChiSquareDistance
  • HellingerDistance
  • LorentzianDistance
  • MatusitaDistance
  • MotykaSimilarity
  • RuzickaSimilarity
  • SoergleDistance
  • StamatatosDistance
  • EuclideanDistance
  • CzekanowskiSimilarity


Using the Cranfield Collection and the sample.php file

Cranfield Collection has been the pioneer collection in information retrieval to validate a system's effectiveness.

I've included the 1400 abstract Cranfield Collection as an XML file that you can parse into separate files. (There is one in the Cranfield folder and it's used in the sample.php file)

The test file at tests/sample.php can be executed right away to do the parsing and do a search for a single test query. (the default is set to do ModBM25)

Customize it to your needs if needed.

You can read Cranfield/cranfield-collection/cranqrel for Glassgow's qrels result.

I've also included SMART system's stopword list for standardization (see stopwords/stopwords.txt).

This is what's currently on the sample sample.php



include_once(__DIR__.'/../autoload.php');
include_once(__DIR__.'/../Cranfield/cranfield_parser.php');

use Basset\Documents\Document;
use Basset\Documents\TokensDocument;
use Basset\MetaData\MetaData;

use Basset\Search\Search;
use Basset\Models\TfIdf;
use Basset\Models\ModBM25;
use Basset\Metric\SqrtCosineSimilarity;

use Basset\Normalizers\English;
use Basset\Tokenizers\WhitespaceAndPunctuationTokenizer;
use Basset\Utils\StopWords;
use Basset\Utils\TransformationSet;

use Basset\Index\IndexReader;
use Basset\Index\IndexWriter;

use Basset\Models\DFIModel;
use Basset\Models\DFIModels\ChiSquared;
use Basset\Models\Idf;

use Basset\Expansion\IdeDecHi;


class Similarity {

public function test() {
    $start = microtime(true);
    // Initialized required stuff.

    // THE DOCUMENTS
    $cranfield = new CranfieldParser(__DIR__.'/../Cranfield/cranfield-collection/cran.all.1400.xml-format.xml');
    $documents = $cranfield->parse(); 

    // This is a set of NLP stuff used to analyze each tokens(terms) in a given Document.
    $stopwords = file_get_contents(__DIR__.'/../stopwords/stopwords.txt');
    $tokenizer = new WhitespaceAndPunctuationTokenizer;

    $pipeline = array(
                new StopWords($tokenizer->tokenize($stopwords)),
                new English,
                // also stemmer if you have any, as I don't have any. Make sure it implements TransformationInterface
                );
    $transform = new TransformationSet;
    $transform->register($pipeline);

        /** 
         *
         * Start indexing files.
         * IndexWriter() takes an optional directory path, and it will create an index folder with an .idx file inside.
         * 
         * Everything is commited to disk once close() is called, otherwise you can keep adding document/s.
         * The file created is Basset's inverted index file.
         *
         * Once created, IndexReader() takes care of reading it.
         *
         * If a custom directory path is created (and an optional custom file name thru setFileName()), you need 
         * to specify the path in IndexReader(), otherwise it'll just look for a default file (index/basset_index.idx).
         *
         * The created index file acts as the persistent data for all documents added in the collection. Once created,
         * you can comment out the code below to avoid re-indexing the same collection during run-time (if you're
         * simply trying out different models for the same collection, e.g., NPL, Medline, TREC, etc.).
         *
         */

    $index = new IndexWriter(__DIR__.'/../custom_index');
    $index->setFileName('mycustomindex');
    foreach($documents as $title => $body){
       $index->addDocument(new TokensDocument($tokenizer->tokenize($body)), new MetaData(array('title' => $title)));
    }
    $index->applyTransformation($transform);
    $index->close();

    // MetaData class is a wrapper for assigning any array of info for a given doc, be it a title, path or a url, etc.

    /** 
     * Dumping $index->getLocation() gives '../custom_index/mycustomindex.idx' which should be fed as parameter
     * for IndexReader.
     */

    // prepare one query as Document instance from Cranfield/cranfield-collection/cran.qry.xml-format
    $query = new Document(new TokensDocument($tokenizer->tokenize('what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft.')));
    $query->applyTransformation($transform);


    /**
     *
     * Start search.
     *
     * There has been changes in class name and operations since the v1 release (to accomodate for
     * structural changes).
     *
     * DocumentRanking became Search(mostly working as a manager for everything) and requires an IndexReader 
     * instance.
     *
     * Weighting Models are set thru model(), where the weighting model used for the query and the metric for 
     * comparing the query against the documents are explicitly set.
     * You can still change them thru queryModel() and similarity(), and the info regarding the defaults are 
     * given from the docs at
     * https://myth-of-sissyphus.blogspot.com/2018/02/basset-information-retrieval-library-in.html
     * 
     */

    $indexReader = new IndexReader(__DIR__.'/../custom_index/mycustomindex.idx'); // read the custom index specified above

    $search = new Search($indexReader);
    $search->query($query);
    $search->model(new ModBM25);
    $search->setQueryExpansion(new IdeDecHi); //all expansion models default to top 10 relevant and non-relevant docs and querylength + 100 top terms to be used for expansion.
    $results = $search->search(15); // default: 10

    $display = array();

    foreach($results->getResults() as $key => $result) {
        $title = $result->getMetaData()->getTag('title'); //getting the title tag from metadata added for the doc.
        $display[$title] = $result->getScore();
    }
        
    print_r($display); // top K docs
    print_r(microtime(true) - $start . "\xA");

 /**
  * search() returns an instance of ResultSet in descending order, and can take a $limit number and boolean $descending as 
  * parameter to display stuff, as 1400 items is a lot of stuff (default is search(10, true)).
  * ResultSet displays result as array thru getResults(), it has docID, score and the given MetaData for the document.
  */

    }

}

// parse Cranfield xml first before getting relevance

$sim = new Similarity;
$sim->test();

TO-DO:
Complete coverage of tests.
Proximity ranking (MarkovRandomFields) - might review indexing to allow blocks.
Allow index updates and deletion.


P.S. This is a library. It can only help your own solutions, or with your research. You can use each classes separately and expand on it, do experiments and stuff, and if you find this helpful and wish to help, feel free to submit PRs. :).

Popular posts from this blog

GeodesyPHP - A Great-Earth Distance library

Geodesy-PHP Geodesy-PHP is a port of some known geodesic/math functions for getting distance from a known point A to a known point B, given their coordinates (good for working out distances between different latitude/longitude data provided by Google Geolocation or any RESTful APIs). It also supports conversion between units of length, Polar position to Cartesian coordinates, and transforming different Reference Datums. It provides distance calculations thru: Spherical Law of Cosines Haversine formula  (Half a Versine - versed sine) Vincenty's formula Thomas' formula Hubeny's formula Andoyer-Lambert's formula Elliptic Distance Forsythe-Andoyer-Lambert Formula Note: This library is a collection that solves the Inverse geodetic problem . Installation: composer require jtejido/geodesy-php Usage Distance Calculation All classes receives and gives all values in  Metre unit of length by default. use Geodesy\Location\LatLong; use Geodes...

Apollo: Where and When

Project: Apollo Abstract GIS-Applications are applications that gathers, manages and analyzes spatial-related data. They are rooted from Geography and is now used in almost all aspects of discipline where Geographic and/or Spatial data manipulation is required. This includes  applications related to  engineering, planning, management, transport/logistics, insurance, telecommunications, and business [ 1 ]. Guttman [ 2 ] started the topic of generalizing B-tree ( a self-balancing  tree data structure  that  maintains sorted data and allows searches, sequential access, insertions, and deletions in  logarithmic time . [ 3 ] ) to index multi-dimensional data (coordinates, polygons, rectangles, among others). Such a step enables the rise of Geo-spatial databases for gathering, analyzing and querying such data (i.e. Esri , GIS-planning  and Database extensions like GeoMesa (for Hadoop), PostGIS (PostgreSQL), Oracle, CouchDB, among others) ...