Scrapping

> > > >

Readability API. Web scraping: Reliably and efficiently pull data from pages that don't expect it. How to crawl a quarter billion webpages in 40 hours. More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances. I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I describe some details of what I did. Of course, there’s nothing especially new: I wrote a vanilla (distributed) crawler, mostly to teach myself something about crawling and distributed computing. Still, I learned some lessons that may be of interest to a few others, and so in this post I describe what I did. What does it mean to crawl a non-trivial fraction of the web? Code: Originally I intended to make the crawler code available under an open source license at GitHub.

There’s a more general issue here, which is this: who gets to crawl the web? I’d be interested to hear other people’s thoughts on this issue. Architecture: Here’s the basic architecture: | CommonCrawl. Boilerpipe - Boilerplate Removal and Fulltext Extraction from HTML pages. The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Commercial support is available through Kohlschütter Search Intelligence. (2011-06-06) boilerpipe 1.2.0.