Latent Semantic Analysis (LSA) Tutorial. Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents.
If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts. Unfortunately, this problem is difficult because English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding. For example, the word bank when used together with mortgage, loans, and rates probably means a financial institution. However, the word bank when used together with lures, casting, and fish probably means a stream or river bank. Latent Semantic Analysis arose from the problem of how to find relevant documents from search words. To see a small example of LSA, take a look at the next section. Internet Search Tool Articles. PhD work - Overview. The Symbol Grounding Problem indicates that a subset of a vocabulary must be grounded in the real, physical world in order for the words to have meaning in one's mind.
But when words have been grounded in this way, how can they develop into a full vocabulary? Looking at dictionaries which use controlled vocabularies to define all the words within them (all words used in the definitions are from a specified subset of the dictionary) could give some idea as to how new words can effectively be grounded by using a small set of pre-grounded terms. Two controlled-vocabulary dictionaries have been used; the Longman's Dictionary of Contemporary English, (LDOCE) and the Cambridge International Dictionary of English (CIDE). Both have controlled vocabularies of around 2,000 words each.
Using these dictionaries, a number of questions can be answered. back to top. A comparison of LSA, wordNet and PMI-IR for predicting user click behavior. A predictive tool to simulate human visual search behavior would help interface designers inform and validate their design.
Such a tool would benefit from a semantic component that would help predict search behavior even in the absence of exact textual matches between goal and target. This paper discusses a comparison of three semantic systems-LSA, WordNet and PMI-IR-to evaluate their performance in predicting the link that people would select given an information goal and a webpage. PMI-IR best predicted human performance as observed in a user study.
Latent semantic analysis. Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the two vectors formed by any two rows.
Values close to 1 represent very similar words while values close to 0 represent very dissimilar words. Overview Occurrence matrix Rank lowering Derivation Semanticvectors - Project Hosting on Google Code. The Semantic Vectors Package SemanticVectors creates semantic WordSpace models from free natural language text.
Such models are designed to represent words and documents in terms of underlying concepts. They can be used for many semantic (concept-aware) matching tasks such as automatic thesaurus generation, knowledge representation, and concept matching. These are described more thoroughly in the UseCases page. Getting Started See GettingStarted and the maven semanticvectors artifact. Algorithms and Techniques The models are created by applying concept mapping algorithms to term-document matrices created using Apache Lucene.
Random Projection is the most scalable technique in practice, because it does not rely on the use of computationally intensive matrix decomposition algorithms. Links to more RelatedResearch and lots of other topics can be found in the Wiki pages. Contributors and Projects There are many ways to get involved, as an end user and a contributor. Issues and Bugs. Find movies, TV shows matching your taste & watch online - Jinni.