background preloader

Konstanz Information Miner

Konstanz Information Miner

ScrapeBox – Harvest, Check, Ping, Post Web-Harvest Project Home Page 1. Welcome screen with quick links 2. Web-Harvest XML editing with auto-completion support (Ctrl + Space) 3. 4. 5. 6. 7. 8. DEiXTo - Web Content Extraction Tool Wiki / Jobs You can select what type of results 80legs generates for you. Available options are: Unique and total count - 80legs outputs the # of unique matches and total # of matches for your content selection strings (i.e., keywords or regular expressions)Boolean array - 80legs outputs the two numbers above plus a 1 or 0 for each string, depending on whether or not that string was foundCount array - 80legs outputs the unique and total count plus the total count for each stringCode results - If you select to analyze content using code, result type will default to this option Here are some examples of each result type. In these examples, we've crawled and analyzed two pages: The contents of the first page are 'test1 test1 test2 test3 test5'. test test1 test2 test3 test4 test5 test6 For 'Unique and total count' the output will be: For 'Boolean array' the output will be:

Features Ready for Mission Critical Applications Simple to Use You can be up and running with Spinn3r in less than an hour. We ship a standard reference client that integrates directly with your pipeline. If you're running Java, you can get up and running in minutes. Real Time Indexing Spinn3r is tied into the blog ping network provided by Google, Blogger, Ping-o-Matic, WordPress, FeedBurner, and many other content management systems. When a new blog post is published, we receive direct notification and add this weblog to the top of our queue. Spam Prevention We've developed complex spam prevention technology to prevent spam from being added to our index. Ultra Reliable Infrastructure Spinn3r is hosted in a world class data center. Spinn3r is monitored 24/7 for any potential error in the system. Massive Cost Savings The bandwidth costs alone for running a crawler can break the bank. Language Classification Every post indexed by Spinn3r is classified by language. Making Your Job Easier Microformats

Scraping · chriso/ Wiki includes a robust framework for scraping data from the web. The primary methods for scraping data are get and getHtml, although there are methods for making any type of request, modifying headers, etc. See the API for a full list of methods. A note before you start scraping The --debug switch is your friend - use it to see the request and response headers, and whether there was an error with the request. If your scraping job is behaving unexpectedly, --debug will show you what's going on under the hood. --debug my_scraping_job Example 1: Save a web page to disk save.js nodeio = require '' class SavePage extends nodeio.JobClass input: false run: () -> url = @options.args[0] @get url, (err, data) => if err? To save a page to disk, run $ -s save " > google.html Which is equivalent to $ curl " > google.html Example 2: Get the number of Google results for a list of keywords keywords.js