background preloader

Crawler/scraper

Facebook Twitter

HTTrack Website Copier - Offline Browser. The Open Graph Protocol. MIT Computer Science and Artificial Intelligence Laboratory | CS.

Nutch/mapReduce/hadloop

References. Google. Algorithms. Applications. DanWeld. Software Agent: MIT Media Lab. Nutch. Latest step by Step Installation guide for dummies: Nutch 0.9 By Peter P. Wang, Zillionics LLC Try the search engine I developed for The Christian Life: Malachi Search Please support my effort by using the best free/low price web hosting: 1&1 Inc peterwang@zillionics.com To add your comments, please go to: Install software one by one First, install cygwin: run cygwinSetup.exe. Second, install JAVA: run dk-6u3-windows-i586-p.exe Third, install Apache: run apache-tomcat-6.0.14.exe.

Run it by clicking the Configure Tomcat icon below. Click the Start button below to start Apache Tomcat Service. Then you will be able to see the following screen in the browser if you go to Fourth, unzip nutch-0.9.tar.gz to any directory you like, e.g. c:\nutch. Setup the crawler In Cygwin window, go to the directory of your nutch, and set your JAVA_HOME as follows.. +^ <name>http.agent.name</name>