Latent semantic indexing ( LSI ) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts. [ 1 ] LSI is also an application of correspondence analysis , a multivariate statistical technique developed by Jean-Paul Benzécri [ 2 ] in the early 1970s, to a contingency table built from word counts in documents. Called Latent Semantic Indexing because of its ability to correlate semantically related terms that are latent in a collection of text, it was first applied to text at Bell Laboratories in the late 1980s.
In corpus linguistics , part-of-speech tagging ( POS tagging or POST ), also called grammatical tagging or word-category disambiguation , is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech , based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase , sentence , or paragraph . A simplified form of this is commonly taught to school-age children, in the identification of words as nouns , verbs , adjectives , adverbs , etc. Once performed by hand, POS tagging is now done in the context of computational linguistics , using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E.
Tf–idf , term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus . It is often used as a weighting factor in information retrieval and text mining . The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query . tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification. [ 1 ] One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model. [ edit ] Motivation