background preloader

Web scraping: Reliably and efficiently pull data from pages that don't expect it

Web scraping: Reliably and efficiently pull data from pages that don't expect it

Related:  scrapyAlltagsHilfen

Link Extractors — Scrapy 0.20.2 documentation Link Extractors¶ LinkExtractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed. There are two Link Extractors available in Scrapy by default, but you create your own custom Link Extractors to suit your needs by implementing a simple interface. urllib2 — extensible library for opening URLs Note The urllib2 module has been split across several modules in Python 3 named urllib.request and urllib.error. The 2to3 tool will automatically adapt imports when converting your sources to Python 3.

Web Scraping with Python Exclusive offer: get 50% off this eBook here Expert Python Programming — Save 50% Best practices for designing, coding, and distributing your Python software by Javier Collado | November 2008 | Open Source HTML Scraping Web Scraping Web sites are written using HTML, which means that each web page is a structured document. Sometimes it would be great to obtain some data from them and preserve the structure while we’re at it. urllib2 - The Missing Manual Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. BaseHTTPServer.BaseHTTPRequestHandler.responses is a useful dictionary of response codes in that shows all the response codes used by RFC 2616. The dictionary is reproduced here for convenience :

selenium 2.21.2 Python bindings for Selenium Python language bindings for Selenium WebDriver. The selenium package is used automate web browser interaction from Python. Several browsers/drivers are supported (Firefox, Chrome, Internet Explorer, PhantomJS), as well as the Remote protocol. Item Pipeline — Scrapy 0.21.0 documentation After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially. Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an Item and perform an action over it, also deciding if the Item should continue through the pipeline or be dropped and no longer processed. Typical use for item pipelines are: cleansing HTML datavalidating scraped data (checking that the items contain certain fields)checking for duplicates (and dropping them)storing the scraped item in a database Writing your own item pipeline¶

Decoding CAPTCHA's Most people don’t know this but my honours thesis was about using a computer program to read text out of web images. My theory was that if you could get a high level of successful extraction you could use it as another source of data which could be used to improve search engine results. I was even quite successful in doing it, but never really followed my experiments up. My honours advisor Dr Junbin Gao had suggested the following writing my thesis I should write some form of article on what I had learnt. Well I finally got around to doing it. While what follows is not exactly what I was studying it is something I wish had existed when I started looking around. How To Write Your First Ruby Web Bot In Watir Time for the fun stuff now. The holy grail for a lot of Internet Marketers is automation. This can be obtained through simple iMacros scripts, some PHP scripts on a server, or with a little tool called Watir using the Ruby programming language.

Item Loaders — Scrapy 0.22.0 documentation Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, the Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it. In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container. Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain. Using Item Loaders to populate items¶

PyQuery pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation. This is not (or at least not yet) a library to produce or interact with javascript code. I just liked the jquery API and I missed it in python so I told myself "Hey let's make jquery in python". This is the result. Emulating a Browser in Python with mechanize It is always useful to know how to quickly instantiate a browser in the command line or inside your python scripts. Every time I need to automate any task regarding web systems I do use this recipe to emulate a browser in python: import mechanize import cookielib # Browser br = mechanize.Browser() # Cookie Jar cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) # Follows refresh 0 but not hangs on refresh > 0 br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) # Want debugging messages? #br.set_debug_http(True) #br.set_debug_redirects(True) #br.set_debug_responses(True) # User-Agent (this is cheating, ok?)

Requests and Responses Using FormRequest to send data via HTTP POST¶ If you want to simulate a HTML Form POST in your spider and send a couple of key-value fields, you can return a FormRequest object (from your spider) like this: Beautiful Soup: We called him Tortoise because he taught us. You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects. If you have questions, send them to the discussion group.