Changing Bits: Language detection with Google's Compact Language Detector Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it. Wonderfully, Google has open-sourced most of Chrome's source code , including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar .
Python framework for fast Vector Space Modelling Package Documentation Latest Version: 0.8.6 Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community. All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM), Intuitive interfaces easy to plug in your own input corpus/datastream (trivial streaming API) easy to extend with other Vector Space algorithms (trivial transformation API) Efficient implementations of popular algorithms, such as online Latent Semantic Analysis , Latent Dirichlet Allocation or Random Projections Distributed computing : can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers. Package Index : gensim 0.7.5
Gensim – Python Framework for Vector Space Modelling — gensim documentation
The Xapian Project
PyLucene JCC vs GCJ [Archive] I'm writing because after spending days with PyLucene versions (GCJ and JCC),I decided to share the story. About 5 days ago I didn't even know about the existence of Lucene. I was having many difficulties indexing full-text data on MySQL - it took like 90 hours indexing ~64 million records in a single varchar(96) field.
PyLucene @ Open Source Applications Foundation
GITS :: Code :: wchartype wchartype is a Python module for getting the types of double-byte (full-width) characters. It has no external dependencies. wchartype is licensed under the MIT license. Usage import wchartype if wchartype.is_asian(u'\u65e5'): print u"\u65e5 is an Asian character" Function Specification