Web Scraping
< Software Development
< jeffsnavely
Get flash to fully experience Pearltrees
Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize . The examples below are written for a website that does not exist ( example.com ), so cannot be run. There are also some working examples that you can run. import re import mechanize br = mechanize.Browser() br. open ( "http://www.example.com/" ) # follow second link with element text matching regular expression response1 = br.follow_link(text_regex= r"cheese\s*shop" , nr= 1 ) assert br.viewing_html() print br.title() print response1.geturl() print response1.info() # headers print response1.read() # body br.select_form(name= "order" ) # Browser passes through unknown attributes (including methods) # to the selected HTMLForm. br[ "cheeses" ] = [ "mozzarella" , "caerphilly" ] # (the method here is __setitem__) # Submit current form.
In preparation for my PyCon talk on HTML I thought I’d do a performance comparison of several parsers and document models. The situation is a little complex because there’s different steps in handling HTML: Parse the HTML Parse it into something (a document object) Serialize it Some libraries handle 1, some handle 2, some handle 1, 2, 3, etc. For instance, ElementSoup uses ElementTree as a document, but BeautifulSoup as the parser.
Most people don’t know this but my honours thesis was about using a computer program to read text out of web images.
Note
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. BaseHTTPServer.BaseHTTPRequestHandler.responses is a useful dictionary of response codes in that shows all the response codes used by RFC 2616. The dictionary is reproduced here for convenience : responses = { 100 : ( 'Continue' , 'Request received, please continue' ) , 101 : ( 'Switching Protocols' , 'Switching to new protocol; obey Upgrade header' ) , 200 : ( 'OK' , 'Request fulfilled, document follows' ) , 201 : ( 'Created' , 'Document created, URL follows' ) , 202 : ( 'Accepted' , 'Request accepted, processing continues off-line' ) , 203 : ( 'Non-Authoritative Information' , 'Request fulfilled from cache' ) , 204 : ( 'No Content' , 'Request fulfilled, nothing follows' ) , 205 : ( 'Reset Content' , 'Clear input form for further input.' ) , 206 : ( 'Partial Content' , 'Partial content follows.' ) ,
A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers. Note that the separate ports are not kept in sync; they are effectively different projects offering similar functionality for their respective languages. Users of the sanitizer must ensure that they serialize with quoted attribute values to avoid some known script injection holes in older browsers including IE < 8 The Ruby port is currently unmaintained Parses valid and invalid HTML documents to a tree Support for minidom , ElementTree (including cElementTree and lxml.etree ), BeautifulSoup (deprecated) and custom simpletree output formats DOM to SAX converter Reports parse errors Character encoding detection Filtering and serializing of trees HTML+CSS sanitizer Many unit tests Using html5Lib <p style="text-align:right;color:#A8A8A8"></p>
pyquery allows you to make jquery queries on xml documents.
You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects. If you have questions, send them to the discussion group .
by Leonard Richardson (leonardr@segfault.org)