background preloader

Lucene

Facebook Twitter

Estimating Memory and Storage for Lucene/Solr. Ll01.nla.gov.au/docs/LuceneRRNotes.html. Set of rules for Lucene relevance ranking Lucene produces a .score. for each record, based on the following: Scores embedded in the indexes: Within the title index, query matches in the 245 $a and $b get a score of 7, while other title fields (e.g. series title, added entry title) get a score of 1.

In author fields, query matches in the 100 field get a score of 4, while added entry authors get a score of 1. Scores based on which index is matched, and how close that match is Every match of the keyword as entered gets a score of 0.5, and a match of the stemmed version of what was entered gets a match of 0.05. Other scores The record is allocated a score based on the square root of the number of holdings contained within it.

Lucene.s default scoring (which we can change) How all this fits together, a guide to users: Lucene doesn.t allocate separate scores for each of these and then add them all up, instead it uses some as a multiplication factor on others. Similarity (Lucene 2.9.0 API) Java.lang.Object org.apache.lucene.search.Similarity All Implemented Interfaces: Serializable Direct Known Subclasses: DefaultSimilarity, SimilarityDelegator public abstract class Similarityextends Objectimplements Serializable Expert: Scoring API. Similarity defines the components of Lucene scoring. Suggested reading: Introduction To Information Retrieval, Chapter 6. The following describes how Lucene scoring evolves from underlying information retrieval models to (efficient) implementation.

Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM. In VSM, documents and queries are represented as weighted vectors in a multi-dimensional space, where each distinct index term is a dimension, and weights are Tf-idf values. VSM does not require weights to be Tf-idf values, but Tf-idf values are believed to produce search results of high quality, and so Lucene is using Tf-idf.

Where See Also: tf.