background preloader

Latent semantic analysis

Latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the two vectors formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words.[1] Overview[edit] Occurrence matrix[edit] Rank lowering[edit] Derivation[edit] Let be a matrix where element in document ). and

http://en.wikipedia.org/wiki/Latent_semantic_analysis

Related:  POLYSEMY AND SYNONYMY

Polysemy Charles Fillmore and Beryl Atkins’ definition stipulates three elements: (i) the various senses of a polysemous word have a central origin, (ii) the links between these senses form a network, and (iii) understanding the ‘inner’ one contributes to understanding of the ‘outer’ one.[3] Polysemy is a pivotal concept within disciplines such as media studies and linguistics. Polysemes[edit] A polyseme is a word or phrase with different, but related senses. Since the test for polysemy is the vague concept of relatedness, judgments of polysemy can be difficult to make. semanticvectors - Project Hosting on Google Code The Semantic Vectors Package SemanticVectors creates semantic WordSpace models from free natural language text. Such models are designed to represent words and documents in terms of underlying concepts. They can be used for many semantic (concept-aware) matching tasks such as automatic thesaurus generation, knowledge representation, and concept matching.

Singular value decomposition Visualization of the SVD of a two-dimensional, real shearing matrixM. First, we see the unit disc in blue together with the two canonical unit vectors. We then see the action of M, which distorts the disk to an ellipse. The SVD decomposes M into three simple transformations: an initial rotationV*, a scaling Σ along the coordinate axes, and a final rotation U. The lengths σ1 and σ2 of the semi-axes of the ellipse are the singular values of M, namely Σ1,1 and Σ2,2. Sentiment analysis Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader). Subtasks[edit]

PhD work - Overview The Symbol Grounding Problem indicates that a subset of a vocabulary must be grounded in the real, physical world in order for the words to have meaning in one's mind. But when words have been grounded in this way, how can they develop into a full vocabulary? Looking at dictionaries which use controlled vocabularies to define all the words within them (all words used in the definitions are from a specified subset of the dictionary) could give some idea as to how new words can effectively be grounded by using a small set of pre-grounded terms. Two controlled-vocabulary dictionaries have been used; the Longman's Dictionary of Contemporary English, (LDOCE) and the Cambridge International Dictionary of English (CIDE).

Synonym In the figurative sense, two words are sometimes said to be synonymous if they have the same connotation: ...a widespread impression that ... Hollywood was synonymous with immorality...[2] Examples[edit] Synonyms can be any part of speech (such as nouns, verbs, adjectives, adverbs or prepositions), as long as both words belong to the same part of speech.

Latent Semantic Analysis (LSA) Tutorial Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts. Unfortunately, this problem is difficult because English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding.

About the Chi-Square Test Generally speaking, the chi-square test is a statistical test used to examine differences with categorical variables. There are a number of features of the social world we characterize through categorical variables - religion, political preference, etc. To examine hypotheses using such variables, use the chi-square test. The chi-square test is used in two similar but distinct circumstances: for estimating how closely an observed distribution matches an expected distribution - we'll refer to this as the goodness-of-fit testfor estimating whether two random variables are independent. The Goodness-of-Fit Test

family/developer GATE Developer is a development environment that provides a rich set of graphical interactive tools for the creation, measurement and maintenance of software components for processing human language. GATE Developer is open source software, available under the GNU Lesser General Public Licence 3.0, and can be downloaded from this page. (GATE Developer and GATE Embedded are bundled, and in older distributions were refered to just as "GATE".) Uses

Related: