
Brill POS Tagger for Win32 Paul Maddox ECML/PKDD'02 Tutorial on Text Mining and Internet Content filtering José María Gómez Hidalgo Departamento de Inteligencia Artificial Universidad Europea de Madrid In the recent years, we have witnessed an impressive growth of the availability of information in electronic format, mostly in the form of text, due to the Internet and the increasing number and size of digital and corporate libraries. The overwhelming amount of text is hardly to consume for an average human being, who faces an information overload problem. TM is an emerging research and development field that address the information overload problem borrowing techniques from data mining, machine learning, information retrieval, natural-language understanding, case-based reasoning, statistics, and knowledge management to help people gain rapid insight into large quantities of semi-structured or unstructured text. A prototypical application of TM techniques is Internet information filtering. Outline The tutorial is divided into two main parts. 1. Marti Hearst and Dunja Mladenic home pages 2. 3. 4.
Natural language processing Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation. History[edit] The history of NLP generally starts in the 1950s, although work can be found from earlier periods. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966. NLP using machine learning[edit] Parsing
montylingua :: a free, commonsense-enriched natural language understander Recent bugfixes Version 2.1 (6 Aug 2004) - includes new MontyNLGenerator component generates sentences and summaries Version 2.0.1 - fixes API bug in version 2.0 which prevents java api from being callable What is MontyLingua? [top] MontyLingua is a free*, commonsense-enriched, end-to-end natural language understander for English. Version 2.0 is substantially FASTER, MORE ACCURATE, and MORE RELIABLE than version 1.3.1. MontyLingua differs from other natural language processing tools because: MontyLingua performs the following tasks over text: MontyTokenizer - Tokenizes raw English text (sensitive to abbreviations), and resolve contractions, e.g. * free for non-commercial use. please see MontyLingua Version 2.0 License Terms of Use [top] Author: Hugo Liu <hugo@media.mit.edu> Project Page: < Documentation [top] New in version 2.0 (29 Jul 2004) Download MontyLingua [top] READ THIS if you are running ML on Mac OS X, or Unix William W. L.
ConceptNet What is ConceptNet? [top] ConceptNet is a freely available commonsense knowledgebase and natural-language-processing toolkit which supports many practical textual-reasoning tasks over real-world documents right out-of-the-box (without additional statistical training) including topic-jisting (e.g. a news article containing the concepts, “gun,” “convenience store,” “demand money” and “make getaway” might suggest the topics “robbery” and “crime”), affect-sensing (e.g. this email is sad and angry), analogy-making (e.g. “scissors,” “razor,” “nail clipper,” and “sword” are perhaps like a “knife” because they are all “sharp,” and can be used to “cut something”), text summarization contextual expansion causal projection cold document classification and other context-oriented inferences The ConceptNet knowledgebase is a semantic network presently available in two versions: concise (200,000 assertions) and full (1.6 million assertions). Papers about ConceptNet [top]: Download ConceptNet [top] S.
The `Bow' Toolkit Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow). The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students. The name of the library rhymes with `low', not `cow'. About the library The library provides facilities for: Recursively descending directories, finding text files. The library does not: Have English parsing or part-of-speech tagging facilities. It is known to compile on most UNIX systems, including Linux, Solaris, SUNOS, Irix and HPUX. The code conforms to the GNU coding standards. Citation McCallum, Andrew Kachites. Here is a BiBTeX entry: Obtaining the Source Source code for the library can be downloaded from this directory.
LDC - Linguistic Data Consortium - Current Projects Maximum Entropy Modeling Using SharpEntropy. Free source code and programming articles Overview This article presents a maximum entropy modeling library called SharpEntropy, and discusses its usage, first by way of a simple example of predicting outcomes, and secondly, by presenting a way of splitting English sentences into constituent tokens (useful for natural language processing). Please note that because most of the code is a conversion based on original Java libraries published under the LGPL license, the source code available for download with this article is also released under the LGPL license. This means, it can freely be used in software that is released under any sort of license, but if you make changes to the library itself and those changes are not for your private use, you must release the source code to those changes. A second article, Statistical parsing of English sentences, shows how SharpEntropy can be used to perform sophisticated natural language processing tasks. Introduction SharpEntropy is a C# port of the MaxEnt toolkit available from SourceForge.