background preloader

Apache UIMA - Apache UIMA

Apache UIMA - Apache UIMA
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes. Apache UIMA is an Apache-licensed open source implementation of the UIMA specification [pdf] [doc] (that specification is, in turn, being developed concurrently by a technical committee within OASIS , a standards organization).

Related:  Concept extractionAI

maui-indexer - Maui - Multi-purpose automatic topic indexing Summary Maui automatically identifies main topics in text documents. Depending on the task, topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles. Maui performs the following tasks: term assignment with a controlled vocabulary (or thesaurus) subject indexing topic indexing with terms from Wikipedia keyphrase extraction terminology extraction automatic tagging It can also be used for terminology extraction and semi-automatic topic indexing. Alexandre Bouchard-Côté General Email: bouchard AT stat.ubc.caAssistant Professor in the Department of Statistics at UBCPath: McGill -> UCB -> UBC. AKA: Alex, Bouchard, or 卜利森. See also: how to typeset my last name.Office: ESB, Room 3124Resumé (last updated: Nov. '13) Research Interests My main field of research is in statistical machine learning.

LanguageWare Resource Workbench Update: July 20, 2012: Studio 3.0 is out and it is officially bundled with ICA 3.0. If you are a Studio 3.0 user, please use ICA forum instead of LRW forum. LRW is a fixpack that resolves issues in various areas including the Parsing Rules editor, PEAR file export and Japanese/Chinese language support. LRW is still available for download on the Downloads link for IBM OmniFind Enterprise Edition V9.1 Fix Pack users. What is IBM LanguageWare? IBM LanguageWare is a technology which provides a full range of text analysis functions. It is used extensively throughout the IBM product suite and is successfully deployed in solutions which focus on mining facts from large repositories of text. LanguageWare is the ideal solution for extracting the value locked up in unstructured text information and exposing it to business applications.

Lucene - Apache Lucene Core Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. Please use the links on the right to access Lucene. Lucene offers powerful features through a simple API:

Kea 1. Documents - Kea gets a directory name and processes all documents in this directory that have the extension ".txt". The default language and the encoding is set to English, but this can be changed as long as a corresponding stopword file and a stemmer is provided. 2. ELKI Description[edit] The university project is developed for use in teaching and research. The source code is written with extensibility, readability and reusability in mind, but it is not extensively optimized for performance. Triplify — Agile Knowledge Management and Semantic Web (AKSW) More than 20 European Union Datasets Converted to RDF by LATC Project Over the past two years, the LATC project (Linked Open Data Around-The-Clock) has worked on converting more than 20 EU datasets to RDF, make them available as Linked Data and SPARQL, and link them to other datasets. The datasets have gone through internal quality assurance against a publication checklist. Read more about "More than 20 European Union Datasets Converted to RDF by LATC Project"

NERD: Named Entity Recognition and Disambiguation This version: 2012-11-07 - v0.5 [ n3 ] History: 2011-10-04 - v0.4 [ n3 ] 2011-08-31 - v0.3 [ n3 ] 2011-08-01 - v0.2 [ n3 ] Apache Spark Apache Spark is an open-source[1] data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).[2] However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce, for certain applications.[3] Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms.[4] Spark became an Apache Top-Level Project in February 2014,[5] and was previously an Apache Incubator project since June 2013.[6] It has received code contributions from large companies that use Spark, including Yahoo! Features[edit] Java, Scala, and Python APIs.Proven scalability to 100 nodes in the research lab[14] and 80 nodes in production at Yahoo!. External links[edit]

FUSION Semantic Registry UDDI-based Web service registries are included as a standard offering within the product suite of all major SOA vendors, serving as the foundation for establishing design-time and run-time SOA governance. Despite the success of the UDDI specification and its rapid uptake by the industry, the capabilities of its offered service discovery facilities are rather limited. The lack of machine-understandable semantics in the technical specifications and classification schemes that are used for retrieving services prevent UDDI registries from supporting fully automated and thus truly effective service discovery. The FUSION Semantic Registry is a semantically-enhanced service registry that builds on the UDDI specification and augments its service publication and discovery facilities to overcome these limitations. Kourtesis D. and Paraskakis I. Supporting Semantically Enhanced Web Service Discovery for Enterprise Application Integration.

The Stanford NLP (Natural Language Processing) Group About | Questions | Mailing lists | Download | Extensions | Models | Online demo | Release history | FAQ About Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.