background preloader

Latent semantic analysis

Latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. LSA was patented in 1988 (US Patent 4,839,853) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. Overview[edit] Occurrence matrix[edit] Rank lowering[edit] The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:

Singular value decomposition Visualization of the SVD of a two-dimensional, real shearing matrixM. First, we see the unit disc in blue together with the two canonical unit vectors. We then see the action of M, which distorts the disk to an ellipse. The SVD decomposes M into three simple transformations: an initial rotationV*, a scaling Σ along the coordinate axes, and a final rotation U. Formally, the singular value decomposition of an m×n real or complex matrix M is a factorization of the form where U is a m×m real or complex unitary matrix, Σ is an m×n rectangular diagonal matrix with nonnegative real numbers on the diagonal, and V* (the conjugate transpose of V, or simply the transpose of V if V is real) is an n×n real or complex unitary matrix. The singular value decomposition and the eigendecomposition are closely related. Statement of the theorem[edit] The diagonal entries of Σ are known as the singular values of M. Intuitive interpretations[edit] Rotation, scaling[edit] Example[edit] Consider the 4×5 matrix Notice

TF-IDF Un article de Wikipédia, l'encyclopédie libre. Le TF-IDF (de l'anglais Term Frequency-Inverse Document Frequency) est une méthode de pondération souvent utilisée en recherche d'information et en particulier dans la fouille de textes. Cette mesure statistique permet d'évaluer l'importance d'un terme contenu dans un document, relativement à une collection ou un corpus. Le poids augmente proportionnellement au nombre d'occurrences du mot dans le document. Il varie également en fonction de la fréquence du mot dans le corpus. Introduction[modifier | modifier le code] La justification théorique de ce schéma de pondération repose sur l'observation empirique de la fréquence des mots dans un texte qui est donnée par la Loi de Zipf. Définition formelle[modifier | modifier le code] Fréquence du terme[modifier | modifier le code] La fréquence d'un terme (term frequency) est simplement le nombre d'occurrences de ce terme dans le document considéré (on parle de « fréquence » par abus de langage). où :

Sentiment analysis Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader). Subtasks[edit] A basic task in sentiment analysis[1] is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Methods and features[edit] Evaluation[edit] References[edit] Papers

PhD work - Overview The Symbol Grounding Problem indicates that a subset of a vocabulary must be grounded in the real, physical world in order for the words to have meaning in one's mind. But when words have been grounded in this way, how can they develop into a full vocabulary? Looking at dictionaries which use controlled vocabularies to define all the words within them (all words used in the definitions are from a specified subset of the dictionary) could give some idea as to how new words can effectively be grounded by using a small set of pre-grounded terms. Two controlled-vocabulary dictionaries have been used; the Longman's Dictionary of Contemporary English, (LDOCE) and the Cambridge International Dictionary of English (CIDE). Using these dictionaries, a number of questions can be answered. back to top

QMSS e-Lessons | About the Chi-Square Test Generally speaking, the chi-square test is a statistical test used to examine differences with categorical variables. There are a number of features of the social world we characterize through categorical variables - religion, political preference, etc. To examine hypotheses using such variables, use the chi-square test. The chi-square test is used in two similar but distinct circumstances: for estimating how closely an observed distribution matches an expected distribution - we'll refer to this as the goodness-of-fit testfor estimating whether two random variables are independent. The Goodness-of-Fit Test One of the more interesting goodness-of-fit applications of the chi-square test is to examine issues of fairness and cheating in games of chance, such as cards, dice, and roulette. So how can the goodness-of-fit test be used to examine cheating in gambling? One night at the Tunisian Nights Casino, renowned gambler Jeremy Turner (a.k.a. Recap Testing Independence Example 1. 2. 3. 4.

Term-weighting approaches in automatic text retrieval BibTeX @INPROCEEDINGS{Salton88term-weightingapproaches, author = {Gerard Salton and Christopher Buckley}, title = {Term-weighting approaches in automatic text retrieval}, booktitle = {INFORMATION PROCESSING AND MANAGEMENT}, year = {1988}, pages = {513--523}, publisher = {}} Years of Citing Articles Bookmark OpenURL Abstract The experimental evidence accumulated over the past 20 years indicates that text indexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. twitrratr Curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The term curse of dimensionality was coined by Richard E. Bellman when considering problems in dynamic optimization.[1][2] The "curse of dimensionality" depends on the algorithm[edit] The "curse of dimensionality" is not a problem of high-dimensional data, but a joint problem of the data and the algorithm being applied. It arises when the algorithm does not scale well to high-dimensional data, typically due to needing an amount of time or memory that is exponential in the number of dimensions of the data. When facing the curse of dimensionality, a good solution can often be found by changing the algorithm, or by pre-processing the data into a lower-dimensional form. Combinatorics[edit] Sampling[edit] .

Document Clustering in Objective-C

Related: