Changing Bits: Language detection with Google's Compact Language Detector Changing Bits: Language detection with Google's Compact Language Detector Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it. Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar.
Package Index > SimSearch > 0.2 Not Logged In Status SimSearch 0.2 SimSearch 0.2
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project. NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

Natural Language Toolkit

Natural Language Toolkit
Python framework for fast Vector Space Modelling Package Documentation Latest Version: 0.8.9 Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community. All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM),Intuitive interfaceseasy to plug in your own input corpus/datastream (trivial streaming API)easy to extend with other Vector Space algorithms (trivial transformation API)Efficient implementations of popular algorithms, such as online Latent Semantic Analysis, Latent Dirichlet Allocation or Random ProjectionsDistributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.Extensive HTML documentation and tutorials. Package Index : gensim 0.7.5 Package Index : gensim 0.7.5
TinySegmenter in Python What is this? “TinySegmenter in Python” is a Python ver. of TinySegmenter, which is an extremely compact (23KB) Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo. It works on Python 2.5 or above. “TinySegmenter in Python”‘s interface is compatible with NLTK’s TokenizerI, although the distribution file below does not directly depend on NLTK. If you’d like to use it as a tokenizer in NLTK, you have to modify the first few lines of the code as below: TinySegmenter in Python
What is Scrapy? Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features Simple Scrapy was designed with simplicity in mind, by providing the features you need without getting in your way

Scrapy | An open source web scraping framework for Python

Scrapy | An open source web scraping framework for Python
Gensim – Python Framework for Vector Space Modelling — gensim documentation
Welcome to the Xapian project website. Xapian is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C#, Ruby, Lua and Erlang (so far!) The Xapian Project The Xapian Project
machinelearning

PyLucene JCC vs GCJ [Archive] PyLucene JCC vs GCJ [Archive] I'm writing because after spending days with PyLucene versions (GCJ and JCC),I decided to share the story. About 5 days ago I didn't even know about the existence of Lucene. I was having many difficulties indexing full-text data on MySQL - it took like 90 hours indexing ~64 million records in a single varchar(96) field.
PyLucene @ Open Source Applications Foundation
GITS :: Code :: wchartype GITS :: Code :: wchartype wchartype is a Python module for getting the types of double-byte (full-width) characters. It has no external dependencies. wchartype is licensed under the MIT license. Usage import wchartype if wchartype.is_asian(u'\u65e5'): print u"\u65e5 is an Asian character" Function Specification