Words&text

Facebook Twitter

Language detection with Google's Compact Language Detector. Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.

Language detection with Google's Compact Language Detector

Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar. It turns out the CLD part of the Chromium source tree is nicely standalone, so I pulled it out into a new separate Google code project, making it possible to use CLD directly from any C++ code. I also added basic initial Python binding (one method!) , and ported the small C++ unit test (verifying detection of known strings for 64 different languages) to Python (it passes!).

SimSearch 0.2. Google Labs - Books Ngram Viewer. Natural Language Toolkit. Package Index : gensim 0.7.5. Python framework for fast Vector Space Modelling Package Documentation Latest Version: 0.9.1 Gensim is a Python library for Vector Space Modelling with very large corpora.

Package Index : gensim 0.7.5

Target audience is the Natural Language Processing (NLP) community. All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM),Intuitive interfaceseasy to plug in your own input corpus/datastream (trivial streaming API)easy to extend with other Vector Space algorithms (trivial transformation API)Efficient implementations of popular algorithms, such as online Latent Semantic Analysis, Latent Dirichlet Allocation or Random ProjectionsDistributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.Extensive HTML documentation and tutorials.

If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia. The simple way to install gensim is: TinySegmenter in Python. What is this? “TinySegmenter in Python” is a Python ver. of TinySegmenter, which is an extremely compact (23KB) Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo. It works on Python 2.5 or above. “TinySegmenter in Python”‘s interface is compatible with NLTK’s TokenizerI, although the distribution file below does not directly depend on NLTK. If you’d like to use it as a tokenizer in NLTK, you have to modify the first few lines of the code as below: import nltk import re from nltk.tokenize.api import * class TinySegmenter(TokenizerI):

An open source web scraping framework for Python. Gensim – Python Framework for Vector Space Modelling — gensim documentation. The Xapian Project.

Machinelearning

PyLucene JCC vs GCJ [Archive] I'm writing because after spending days with PyLucene versions (GCJ and JCC),I decided to share the story.

PyLucene JCC vs GCJ [Archive]

About 5 days ago I didn't even know about the existence of Lucene. I was having many difficulties indexing full-text data on MySQL - it took like 90 hours indexing ~64 million records in a single varchar(96) field. And when I tough it was finishing, I got a reboot on the development computer. Surfing the web and talking with friends, I was trying to use something better (and faster) to get my data indexed in a smart way, making it easy to search. Since I don't have heavy inserts and updates, I decided to use Lucene. Well, I chose PyLucene because much of my work was already written in python, and also because I prefer python over java. PyLucene @ Open Source Applications Foundation. Wchartype. Wchartype is a Python module for getting the types of double-byte (full-width) characters.

wchartype

It has no external dependencies. wchartype is licensed under the MIT license. Usage import wchartype if wchartype.is_asian(u'\u65e5'): print u"\u65e5 is an Asian character" Function Specification is_asian True if the character is Asian (char code greater than 0x3000) is_full_width AKA Zenkaku -- True if Asian or an ideographic space. is_kanji True if Kanji character (or Chinese) is_hanzi Alias for is_kanji.