background preloader

Latent semantic analysis

Latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. LSA was patented in 1988 (US Patent 4,839,853) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. Overview[edit] Occurrence matrix[edit] Rank lowering[edit] The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:

Singular value decomposition Visualization of the SVD of a two-dimensional, real shearing matrixM. First, we see the unit disc in blue together with the two canonical unit vectors. We then see the action of M, which distorts the disk to an ellipse. The SVD decomposes M into three simple transformations: an initial rotationV*, a scaling Σ along the coordinate axes, and a final rotation U. Formally, the singular value decomposition of an m×n real or complex matrix M is a factorization of the form where U is a m×m real or complex unitary matrix, Σ is an m×n rectangular diagonal matrix with nonnegative real numbers on the diagonal, and V* (the conjugate transpose of V, or simply the transpose of V if V is real) is an n×n real or complex unitary matrix. The singular value decomposition and the eigendecomposition are closely related. Statement of the theorem[edit] The diagonal entries of Σ are known as the singular values of M. Intuitive interpretations[edit] Rotation, scaling[edit] Example[edit] Consider the 4×5 matrix Notice

Sentiment analysis Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader). Subtasks[edit] A basic task in sentiment analysis[1] is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Methods and features[edit] Evaluation[edit] References[edit] Papers

PhD work - Overview The Symbol Grounding Problem indicates that a subset of a vocabulary must be grounded in the real, physical world in order for the words to have meaning in one's mind. But when words have been grounded in this way, how can they develop into a full vocabulary? Looking at dictionaries which use controlled vocabularies to define all the words within them (all words used in the definitions are from a specified subset of the dictionary) could give some idea as to how new words can effectively be grounded by using a small set of pre-grounded terms. Two controlled-vocabulary dictionaries have been used; the Longman's Dictionary of Contemporary English, (LDOCE) and the Cambridge International Dictionary of English (CIDE). Using these dictionaries, a number of questions can be answered. back to top

QMSS e-Lessons | About the Chi-Square Test Generally speaking, the chi-square test is a statistical test used to examine differences with categorical variables. There are a number of features of the social world we characterize through categorical variables - religion, political preference, etc. To examine hypotheses using such variables, use the chi-square test. The chi-square test is used in two similar but distinct circumstances: for estimating how closely an observed distribution matches an expected distribution - we'll refer to this as the goodness-of-fit testfor estimating whether two random variables are independent. The Goodness-of-Fit Test One of the more interesting goodness-of-fit applications of the chi-square test is to examine issues of fairness and cheating in games of chance, such as cards, dice, and roulette. So how can the goodness-of-fit test be used to examine cheating in gambling? One night at the Tunisian Nights Casino, renowned gambler Jeremy Turner (a.k.a. Recap Testing Independence Example 1. 2. 3. 4.

twitrratr Curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The term curse of dimensionality was coined by Richard E. Bellman when considering problems in dynamic optimization.[1][2] The "curse of dimensionality" depends on the algorithm[edit] The "curse of dimensionality" is not a problem of high-dimensional data, but a joint problem of the data and the algorithm being applied. It arises when the algorithm does not scale well to high-dimensional data, typically due to needing an amount of time or memory that is exponential in the number of dimensions of the data. When facing the curse of dimensionality, a good solution can often be found by changing the algorithm, or by pre-processing the data into a lower-dimensional form. Combinatorics[edit] Sampling[edit] .

family/developer GATE Developer is a development environment that provides a rich set of graphical interactive tools for the creation, measurement and maintenance of software components for processing human language. GATE Developer is open source software, available under the GNU Lesser General Public Licence 3.0, and can be downloaded from this page. (GATE Developer and GATE Embedded are bundled, and in older distributions were refered to just as "GATE".) Uses Language processing software uses specialised data structures and algorithms such as annotation graphs, finite state machines or support vector machines. Developer is a specialist tool similar in purpose and character to a programmer's integrated development environment (which is one reason we call it "the Eclipse of natural language processing"). The user guide has an example use case. Components produced in GATE Developer can be exported to diverse applications via GATE Embedded. Place in the GATE Process Documentation Screenshots

ODIN - The Online Database of Interlinear Text ODIN, the O nline D atabase of In terlinear text, is a repository of Interlinear Glossed Text (IGT) extracted mainly from scholarly linguistic papers. The repository is both broad-coverage, in that it contains data for a variety of the world's languages (limited only by what data is available and what has been discovered), and rich, in that all data contained in the repository has been subject to linguistic analysis. IGT is a standard method within the field of linguistics for presenting language data, with (1) being a typical example. Common in IGT is a phonetic transcription of the language in question (line 1), a morphosyntactic analysis which includes a morpheme-by-morpheme gloss and grammatical information of varying sorts and granularity (line 2), and a free-translation (line 3). (1) apiya=ya=at QATAMMA=pat tapar-ta at that time=CONJ=3SG.N in the same way rule-PAST "And at the same time he ruled it in the very same manner." ODIN is still under construction. DDLO grant GOLD Community

Balie - Baseline Information Extraction

Related: