Language detection with Google's Compact Language Detector. Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.
Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar. SimSearch 0.2. Google Labs - Books Ngram Viewer. Natural Language Toolkit. Package Index : gensim 0.7.5. TinySegmenter in Python. An open source web scraping framework for Python. Gensim – Python Framework for Vector Space Modelling — gensim documentation.
The Xapian Project.
PyLucene JCC vs GCJ [Archive] I'm writing because after spending days with PyLucene versions (GCJ and JCC),I decided to share the story.
About 5 days ago I didn't even know about the existence of Lucene. I was having many difficulties indexing full-text data on MySQL - it took like 90 hours indexing ~64 million records in a single varchar(96) field. PyLucene @ Open Source Applications Foundation. Wchartype.