Nutch: Getting my Feet Wet. My motivation for learning Nutch is twofold. The first is that we are using Nutch for a number of our more recent crawls, so I figured its something I should know about. The second is that Nutch uses Hadoop Map-Reduce, so I figured I would get some Map-Reduce programming tips by looking at Nutch sources. This post describes my attempt to crawl this blog using Nutch and index it. It also describes a very simple plugin to filter URLs by pattern at index time. Command line Usage. Nutch. Features[edit] The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.
History[edit] Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella. In June, 2003, a successful 100-million-page demonstration system was developed. 80legs. Pastebin - collaborative debugging tool.