background preloader


Facebook Twitter

Language detection with Google's Compact Language Detector. Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.

Language detection with Google's Compact Language Detector

Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar. SimSearch 0.2. Google Labs - Books Ngram Viewer. Natural Language Toolkit. Package Index : gensim 0.7.5. TinySegmenter in Python. An open source web scraping framework for Python. Gensim – Python Framework for Vector Space Modelling — gensim documentation.

The Xapian Project.


PyLucene JCC vs GCJ [Archive] I'm writing because after spending days with PyLucene versions (GCJ and JCC),I decided to share the story.

PyLucene JCC vs GCJ [Archive]

About 5 days ago I didn't even know about the existence of Lucene. I was having many difficulties indexing full-text data on MySQL - it took like 90 hours indexing ~64 million records in a single varchar(96) field. PyLucene @ Open Source Applications Foundation. Wchartype.