Concept mining is an activity that results in the extraction of concepts from artifacts . Solutions to the task typically involve aspects of artificial intelligence and statistics , such as data mining and text mining . Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is nontrivial , but it can provide powerful insights into the meaning, provenance and similarity of documents. [ edit ] Methods Concept mining
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction. Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains.
Terminology extraction Terminology mining , term extraction , term recognition , or glossary extraction , is a subtask of information extraction . The goal of terminology extraction is to automatically extract relevant terms from a given corpus . In the semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the internet . Modeling these communities and their information needs is important for several web applications , like topic-driven web crawlers , [ 1 ] web services , [ 2 ] recommender systems , [ 3 ] etc.
Part-of-speech tagging In corpus linguistics , part-of-speech tagging ( POS tagging or POST ), also called grammatical tagging or word-category disambiguation , is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech , based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase , sentence , or paragraph . A simplified form of this is commonly taught to school-age children, in the identification of words as nouns , verbs , adjectives , adverbs , etc. Once performed by hand, POS tagging is now done in the context of computational linguistics , using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E.
Constraint Grammar Constraint Grammar (CG) is a methodological paradigm for Natural language processing (NLP). Linguist-written, context dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation ( lexeme or base form ), inflexion , derivation , syntactic function , dependency, valency , case roles , semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally (defined distances) or globally (undefined distances). Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags.
Treebank A treebank or parsed corpus is a text corpus in which each sentence has been parsed , i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure , hence the name Treebank. The term Parsed Corpus is often used interchangeably with Treebank : with the emphasis on the primacy of sentences rather than trees. Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags . In turn, treebanks are sometimes enhanced with semantic or other linguistic information.