crawler/scraper

TwitterFacebook
Get flash to fully experience Pearltrees

HTTrack Website Copier - Offline Browser

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online. http://www.httrack.com/
The Open Graph protocol enables any web page to become a rich object in a social graph. For instance, this is used on Facebook to allow any web page to have the same functionality as any other object on Facebook. While many different technologies and schemas exist and could be combined together, there isn't a single technology which provides enough information to richly represent any web page within the social graph. The Open Graph protocol builds on these existing technologies and gives developers one thing to implement. Developer simplicity is a key goal of the Open Graph protocol which has informed many of the technical design decisions . To turn your web pages into graph objects, you need to add basic metadata to your page.

The Open Graph Protocol

http://ogp.me/

MIT Computer Science and Artificial Intelligence Laboratory | CS

MIT CSAIL Project Could Transform Robotic Design and Production http://www.csail.mit.edu/
nutch/mapReduce/hadloop

references

google

algorithms

applications

DanWeld

Software Agent: MIT Media Lab

Welcome The Software Agents Group of the MIT Media Laboratory investigates computer systems to which one can delegate tasks. Software agents differ from conventional software in that they are long-lived, semi-autonomous, proactive, and adaptive. http://agents.media.mit.edu/
http://trackgc.com/tr/resources/articles/NutchGuideForDummies.htm Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read: Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level ( - topN ), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level ( - topN ) for a full crawl can be from tens of thousands to millions, depending on your resources. Web Searching based on the crawling result above :

Nutch