Text extraction

TwitterFacebook
Get flash to fully experience Pearltrees

Overview: Extracting article text from HTML documents | My tech blog.

In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc. http://tomazkovacic.com/blog/14/extracting-article-text-from-html-documents/
http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/

Evaluating Text Extraction Algorithms | My tech blog.

UPDATE 11/6/2011: Added the summary and the results table Lately I’ve been working on evaluating and comparing algorithms, capable of extractinguseful content from arbitrary html documents.
http://tomazkovacic.com/blog/56/list-of-resources-article-text-extraction-from-html-documents/

List of resources: Article text extraction from HTML documents | My tech blog.

UPDATE 21/3/2011: Added reader contributed links to software and API section