background preloader

Html-parser-ruby

Facebook Twitter

Ruby Screen-Scraper in 60 Seconds. By Ilya Grigorik on February 04, 2007 I often find myself trying to automate content extraction from a saved HTML file or a remote server.

Ruby Screen-Scraper in 60 Seconds

I've tried a number of approaches over the years, but the dynamic duo of Hpricot and Firebug blew me away - this is by far the fastest way to get what you want without compromising flexibility. Hpricot is an extremely powerful ruby-based HTML parser, and Firebug is arguably the best on-the-fly development add-on for Firefox. Now, I said it will take you about 60 seconds. I lied, it should take less. Introducing open-uri Ruby comes with a very flexible, production ready library that wraps all http/https connections into a single method call: open. FireBug kung-foo Now that we have the document, we need to pull out some content that interests us - usually, this is the tedious part based on regular expressions, stream parsers, etc. So which came first, the parser, which will extract this, or this quote? Hpricot magic Running our screen-scraper produces:

Rubyful Soup: "The brush has got entangled in it!" Scraping Gmail with Mechanize and Hpricot. This document explains the use of Rubyful Soup: how to create a parse tree, how to navigate it, how to search it, and how to print it out.

Quick Start Here's a Ruby session that demonstrates the basic features of Rubyful Soup. Creating and Feeding the Parser The done method Navigating the Parse Tree When you feed a markup document into one of Rubyful Soup's parser classes, Rubyful Soup transforms the markup into a parse tree: a set of linked objects representing the structure of the document. The parser object is the root of the parse tree. For concreteness, here's a visual representation of the parse tree for the example HTML I introduced in the "Quick Start" section. <html><head><title>Page title </title></head><body><p id="firstpara" align="center">This is paragraph <b>one </b>. This is saying: we've got an html Tag which contains a head Tag and a body Tag. All Tag objects have all of the members listed below (though the actual value of the member may be nil).

Parent contents string to_s. ScRUBYt! - a Simple to Learn and Use, yet Powerful Web Scraping. Mofo - a ruby microformat parser. Html5lib - Google Code. HTML parsing as good as Perls. - comp.lang.ruby. First let me be very clear.

HTML parsing as good as Perls. - comp.lang.ruby

I hate the language that Larry "should belined up against a " Wall has written. IMO it encourages peopleto program with, well only men can program that way, instead of theirheads. However as bad as the language is, LWP is one of the best librariesaround when it comes to web related applications. Most notablelyI have never found a library which can parse HTML as well as LWPs HTML parser.

It is my eternal hope that I can find a library asgood, and dump the language. With the advent of Ruby on Rails, I am hopeful that there might be apackage in Ruby that gives Perl's HTML parser a run for it's money. I'm nt looking for an XML parser, XML parsers just can't handlemany of the web sites I want to parse. Finally, there will be some smartass, who will say that I should useweb sites that are written in good HTML.

Thanks The reply-to email address is olczy... **Thaddeus L. There is a difference between*thinking* you know something,and *knowing* you know something. Hpricot, a fast and delightful HTML parser. Scrape the web with ruby « Muharem Hrnjadovic. Introduction In the last few months I have taken some time to play with a number of dynamic languages.

Scrape the web with ruby « Muharem Hrnjadovic

My experiments were mostly in the “web hacks” category e.g. fetching files from the web and extracting data of interest from these. For my most recent hack (get wordpress weblog statistics) I used Ruby. The task at hand The task at hand consists of fetching the weblog statistics for my wordpress weblog and displaying them in the terminal window. The tools used After briefly surveying the tools and libraries available in Ruby-land I settled for WWW::Mechanize a Ruby implementation of Perl‘s venerable WWW-Mechanize CPAN module.