
Words&text
Get flash to fully experience Pearltrees
Changing Bits: Language detection with Google's Compact Language Detector
Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it. Wonderfully, Google has open-sourced most of Chrome's source code , including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar .Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux. News - NLTK development has moved to GitHub [October 2011], Version 2.0.1rc1 released [April 2011], NLTK Cookbook by Jacob Perkins [December 2010], NLTK book in third printing [November 2010], Japanese translation of NLTK book published [November 2010] Courses - ~100 courses in 23 countries using NLTK (artificial intelligence, computational linguistics, information retrieval, machine learning)
Natural Language Toolkit
Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community. Efficient implementations of popular algorithms, such as online Latent Semantic Analysis , Latent Dirichlet Allocation or Random Projections If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia. This software depends on NumPy and Scipy , two Python packages for scientific computing.
Package Index : gensim 0.7.5
TinySegmenter in Python
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Scrapy was designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Scrapy | An open source web scraping framework for Python
Gensim – Python Framework for Vector Space Modelling — gensim documentation
algorithms analysis answer api collection concepts corpus design documents features framework human index infer install introduction latent dirichlet allocation model open-source paragraphs python query questions random reference representation semantic similar space sparse structure SVD text thought topic training tutorials unsupervised vector words >>> from gensim import corpora , models , similarities >>> >>> # Load corpus iterator from a Matrix Market file on disk. >>> corpus = corpora . MmCorpus ( '/path/to/corpus.mm' ) >>> >>> # Initialize a transformation (Latent Semantic Indexing with 200 latent dimensions). >>> lsi = models . LsiModel ( corpus , num_topics = 200 ) >>> >>> # Convert another corpus to the latent space and index it. >>> index = similarities . MatrixSimilarity ( lsi [ another_corpus ]) >>> >>> # determine similarity of a query document against each document in the index >>> sims = index [ query ]Welcome to the Xapian project website. Xapian is an Open Source Search Engine Library, released under the GPL . It's written in C++ , with bindings to allow use from Perl , Python , PHP , Java , Tcl , C# , Ruby and Lua (so far!)
The Xapian Project
machinelearning

