Text extraction
< Development
< Tech
< davidb
Get flash to fully experience Pearltrees
In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.
UPDATE 11/6/2011: Added the summary and the results table Lately I’ve been working on evaluating and comparing algorithms, capable of extractinguseful content from arbitrary html documents.
UPDATE 21/3/2011: Added reader contributed links to software and API section