Web Scraping

TwitterFacebook
Get flash to fully experience Pearltrees
http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot#Python Robots or bots are automatic processes which interact with Wikipedia (and other Wikimedia projects) as though they were human editors. This page attempts to explain how to carry out the development of a bot for use on Wikimedia projects and much of this is transferable to other wikis based on Mediawiki. The explanation is geared mainly towards those who have some prior programming experience, but are unsure of how to apply this knowledge to creating a Wikipedia bot. [ edit ] Why would I need to create a bot?

Creating a bot for Wikipedia

http://www.crummy.com/software/BeautifulSoup/ You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need.

Beautiful Soup

mechanize

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize . The examples below are written for a website that does not exist ( example.com ), so cannot be run. There are also some working examples that you can run. http://wwwsearch.sourceforge.net/mechanize/
http://scrapy.org/ Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Scrapy was designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core

Scrapy | An open source web scraping framework for Python - Icew

http://dcortesi.com/2008/05/28/google-ajax-search-api-example-python-code/ For whatever reason, there aren’t many examples on the net of Python code that can be used with the Google AJAX Search API . I’m not really sure why this is and perhaps I’m missing something, but for future reference here’s some sample python code.

Google AJAX Search API Python Code