background preloader

Natural language processing

Natural language processing
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation. History[edit] The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. NLP using machine learning[edit] Major tasks in NLP[edit] Parsing Related:  Natural Language Processing

The Global WordNet Associaton Coreference resolution WordNet WordNet is a lexical database for the English language.[1] It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and can be downloaded and used freely. The database can also be browsed online. History and team members[edit] WordNet was created at the Cognitive Science Laboratory of Princeton University under the direction of psychology professor George Armitage Miller. Database contents[edit] Example entry "Hamburger" in WordNet good, right, ripe – (most suitable or right for a particular purpose; "a good time to plant tomatoes"; "the right time to act"; "the time is ripe for great sociological changes")

Machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation (MAHT) or interactive translation) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another. On a basic level, MT performs simple substitution of words in one natural language for words in another, but that alone usually cannot produce a good translation of a text because recognition of whole phrases and their closest counterparts in the target language is needed. Solving this problem with corpus and statistical techniques is a rapidly growing field that is leading to better translations, handling differences in linguistic typology, translation of idioms, and the isolation of anomalies.[1] The progress and potential of machine translation have been debated much through its history. History[edit] Translation process[edit] Approaches[edit] Rule-based[edit]

About WordNet - WordNet - About WordNet Morphology (linguistics) The discipline that deals specifically with the sound changes occurring within morphemes is morphophonology. The history of morphological analysis dates back to the ancient Indian linguist Pāṇini, who formulated the 3,959 rules of Sanskrit morphology in the text Aṣṭādhyāyī by using a constituency grammar. The Greco-Roman grammatical tradition also engaged in morphological analysis. Studies in Arabic morphology, conducted by Marāḥ al-arwāḥ and Aḥmad b. ‘alī Mas‘ūd, date back to at least 1200 CE.[1] The term "morphology" was coined by August Schleicher in 1859.[2] Here are examples from other languages of the failure of a single phonological word to coincide with a single morphological word form. kwixʔid-i-da bəgwanəmai-χ-a q'asa-s-isi t'alwagwayu Morpheme by morpheme translation: kwixʔid-i-da = clubbed-PIVOT-DETERMINER bəgwanəma-χ-a = man-ACCUSATIVE-DETERMINER q'asa-s-is = otter-INSTRUMENTAL-3SG-POSSESSIVE t'alwagwayu = club. "the man clubbed the otter with his club." (Notation notes:

Automatic summarization Methods[edit] Methods of automatic summarization include extraction-based, abstraction-based, maximum entropy-based, and aided summarization. Extraction-based summarization[edit] Two particular types of summarization often addressed in the literature are keyphrase extraction, where the goal is to select individual words or phrases to "tag" a document, and document summarization, where the goal is to select whole sentences to create a short paragraph summary. Abstraction-based summarization[edit] Extraction techniques merely copy the information deemed most important by the system to the summary (for example, key clauses, sentences or paragraphs), while abstraction involves paraphrasing sections of the source document. While some work has been done in abstractive summarization (creating an abstract synopsis like that of a human), the majority of summarization systems are extractive (selecting a subset of sentences to place in a summary). Maximum entropy-based summarization[edit]

Natural language generation Natural Language Generation (NLG) is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form. Psycholinguists prefer the term language production when such formal representations are interpreted as models for mental representations. It could be said an NLG system is like a translator that converts a computer based representation into a natural language representation. However, the methods to produce the final language are different from those of a compiler due to the inherent expressivity of natural languages. NLG may be viewed as the opposite of natural language understanding: whereas in natural language understanding the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to make decisions about how to put a concept into words. Simple examples are systems that generate form letters. Example[edit] Stages[edit] Applications[edit]

Snowball Optical character recognition Optical Character Recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. It is widely used as a form of data entry from some sort of original paper data source, whether passport documents, invoices, bank statement, receipts, business card, mail, or any number of printed records. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data extraction and text mining. Early versions needed to be programmed with images of each character, and worked on one font at a time. History[edit] Blind and visually impaired users[edit] Applications[edit] It can be used for: Types[edit] Techniques[edit] Pre-processing[edit] Character recognition[edit] Post-processing[edit] In recent years,[when?]

MALLET homepage

Related: