background preloader

Natural Language Processing

Facebook Twitter

Deep Linquistic Processing

Truecasing. Truecasing is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable.


This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase text messages). Truecasing aids in many other NLP tasks, such as named entity recognition, machine translation and Automatic Content Extraction.[1] Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the Latin, Greek, Cyrillic or Armenian alphabets, such as Japanese, Chinese, Thai, Hebrew, Arabic, Hindi, etc. Jump up ^ Lita, L. Query expansion. Query expansion (QE) is the process of reformulating a seed query to improve retrieval performance in information retrieval operations.[1] In the context of web search engines, query expansion involves evaluating a user's input (what words were typed into the search query area, and sometimes other types of data) and expanding the search query to match additional documents.

Query expansion

Query expansion involves techniques such as: Query expansion is a methodology studied in the field of computer science, particularly within the realm of natural language processing and information retrieval. Precision and recall tradeoffs[edit] Search engines invoke query expansion to increase the quality of user search results. It is assumed that users do not always formulate search queries using the best terms. This tradeoff is one of the defining problems in query expansion, regarding whether it is worthwhile to perform given the questionable effects on precision and recall. Natural language user interface. Proofreading. Professional proofreading[edit] Traditional method[edit] Alternative methods[edit] Copy holding or copy reading employs two readers per proof.


The first reads the text aloud literally as it appears, usually at a comparatively fast but uniform rate of speed. The second reader follows along and marks any pertinent differences between what is read and what was typeset. Experienced copy holders employ various codes and verbal short-cuts that accompany their reading. Double reading. Style guides and checklists[edit] Before it is typeset, copy is often marked up by an editor or customer with various instructions as to typefaces, art, and layout. Checklists are commonly employed in proof-rooms where there is sufficient uniformity of product to distill some or all of its components to a list format.

Text & Speech (Natural Language Processing)

Speech Segmentation. Speech Recognition. Speech Synthesis. Expert Systems (Natural Language Processing) Question Answering. Information Extraction. Information Retrieval. Optical character recognition. Optical Character Recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text.

Optical character recognition

It is widely used as a form of data entry from some sort of original paper data source, whether passport documents, invoices, bank statement, receipts, business card, mail, or any number of printed records. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data extraction and text mining.

OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Early versions needed to be programmed with images of each character, and worked on one font at a time.

Parsing (Natural Language Processing)

Sentence Boundary Disambiguation. Text Segmentation. Relationship Extraction. Named-Entity Recognition. Part of Speech Tagging. Stemming. Stemming programs are commonly referred to as stemming algorithms or stemmers.


Examples[edit] A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".

History[edit] The first published stemmer was written by Julie Beth Lovins in 1968.[1] This paper was remarkable for its early date and had great influence on later work in this area. A later stemmer was written by Martin Porter and was published in the July 1980 issue of the journal Program. Algorithms[edit] Lookup algorithms[edit] The production technique[edit] Natural language generation. Natural Language Generation (NLG) is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form.

Natural language generation

Psycholinguists prefer the term language production when such formal representations are interpreted as models for mental representations. It could be said an NLG system is like a translator that converts a computer based representation into a natural language representation. However, the methods to produce the final language are different from those of a compiler due to the inherent expressivity of natural languages. NLG may be viewed as the opposite of natural language understanding: whereas in natural language understanding the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to make decisions about how to put a concept into words.

Simple examples are systems that generate form letters.

Natural Language Understanding

Morphology (linguistics) The discipline that deals specifically with the sound changes occurring within morphemes is morphophonology.

Morphology (linguistics)

The history of morphological analysis dates back to the ancient Indian linguist Pāṇini, who formulated the 3,959 rules of Sanskrit morphology in the text Aṣṭādhyāyī by using a constituency grammar. The Greco-Roman grammatical tradition also engaged in morphological analysis. Studies in Arabic morphology, conducted by Marāḥ al-arwāḥ and Aḥmad b. ‘alī Mas‘ūd, date back to at least 1200 CE.[1] The term "morphology" was coined by August Schleicher in 1859.[2] Here are examples from other languages of the failure of a single phonological word to coincide with a single morphological word form. Machine translation. Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation (MAHT) or interactive translation) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another.

Machine translation

On a basic level, MT performs simple substitution of words in one natural language for words in another, but that alone usually cannot produce a good translation of a text because recognition of whole phrases and their closest counterparts in the target language is needed. Solving this problem with corpus and statistical techniques is a rapidly growing field that is leading to better translations, handling differences in linguistic typology, translation of idioms, and the isolation of anomalies.[1] The progress and potential of machine translation have been debated much through its history. Natural language processing. Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.

Natural language processing

As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation. History[edit] The history of NLP generally starts in the 1950s, although work can be found from earlier periods.

In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Natural Language.