background preloader

Apache Tika - Apache Tika

Apache Tika - Apache Tika
Apache Tika - a content analysis toolkit The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika. Tika is a project of the Apache Software Foundation, and was formerly a subproject of Apache Lucene. Latest News

Related:  Concept extractionSemantic (web)

Language Computer - Cicero On-Demand API The Cicero On-Demand provides a RESTful interface that wraps LCC's CiceroLite and other NLP components. This API is used for Cicero On-Demand whether the server is the one hosted at LCC or is run locally on your machine. You can access a free, rate-limited version online, as described below, at Index Microsoft Office Files with Lucene Christoph Hartmann on January 7th, 2009 Within my current research project I faced the challenge to index a whole bunch of files. To be platform independent the Java programming language was the first choice. Then I came along the Lucene project. Lucene is an open-source project that “provides Java-based indexing and search technology”. I have to mention that Lucene is a framework library instead of an out-of-the-box application.

The Stanford NLP (Natural Language Processing) Group About | Questions | Mailing lists | Download | Extensions | Models | Online demo | Release history | FAQ About Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. Dolphin Download Dolphin Description The goal of “Dolphin” for “Grasshopper” is to support the early stages of the design process using information technology in order to find architectural solutions as a source of inspiration, an explicit solution, or a means to better understand current design problems.

IndexWriterConfig (Lucene 4.6.0 API) Expert: set the interval between indexed terms. Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms. This parameter determines the amount of computation required per query term, regardless of the number of documents that contain that term. In particular, it is the maximum number of other terms that must be scanned before a term is located and its frequency and position information may be processed. For Academics - Sentiment140 - A Twitter Sentiment Analysis Tool Is the code open source? Unfortunately the code isn't open source. There are a few tutorials with open source code that have similar implementations to ours:

UIMA - Standard for unstructured information UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on the website of the Apache Software Foundation. Another use of UIMA is in systems that are used in medical contexts to analyze clinical notes, such as the Clinical Text Analysis and Knowledge Extraction System (CTAKES). Structure of UIMA[edit] The UIMA architecture can be thought of in four dimensions:

Lucene - Index File Formats Index File Formats This document defines the index file formats used in Lucene version 3.0. If you are using a different version of Lucene, please consult the copy of docs/fileformats.html that was distributed with the version you are using. Apache Lucene is written in Java, but several efforts are underway to write versions of Lucene in other programming languages. If these versions are to remain compatible with Apache Lucene, then a language-independent definition of the Lucene index format is required. This document thus attempts to provide a complete and independent definition of the Apache Lucene 3.0 file formats.