background preloader

Words&text

Facebook Twitter

Language detection with Google's Compact Language Detector. Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it. Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar. It turns out the CLD part of the Chromium source tree is nicely standalone, so I pulled it out into a new separate Google code project, making it possible to use CLD directly from any C++ code.

I also added basic initial Python binding (one method!) So detecting language is now very simple from Python: import cld topLanguageName = cld.detect(bytes)[0] You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand. Generated by dsites 2008.07.07 from 10% of Base Thank you Google! SimSearch 0.2. Google Labs - Books Ngram Viewer. Natural Language Toolkit. Package Index : gensim 0.7.5. TinySegmenter in Python. Scrapy | An open source web scraping framework for Python. Gensim – Python Framework for Vector Space Modelling — gensim documentation. The Xapian Project.

Machinelearning

PyLucene JCC vs GCJ [Archive] I'm writing because after spending days with PyLucene versions (GCJ and JCC),I decided to share the story. About 5 days ago I didn't even know about the existence of Lucene. I was having many difficulties indexing full-text data on MySQL - it took like 90 hours indexing ~64 million records in a single varchar(96) field. And when I tough it was finishing, I got a reboot on the development computer.

Surfing the web and talking with friends, I was trying to use something better (and faster) to get my data indexed in a smart way, making it easy to search. Well, I chose PyLucene because much of my work was already written in python, and also because I prefer python over java. That PyLucene-2.2.0-1.tar.gz is the GCJ version of PyLucene, and it's pretty simple to install. Sudo cp -a PyLucene-2.2.0-1/python/* /usr/local/lib/python2.5/site-packages/ It's also needed to copy some libraries, needed by GCJ: That's it. JCC is ready. If (! 1. PyLucene @ Open Source Applications Foundation. GITS :: Code :: wchartype.