background preloader

Nokogiri

Facebook Twitter

Ruby/XML, XSLT and XPath Tutorial. What is XML ?

Ruby/XML, XSLT and XPath Tutorial

The Extensible Markup Language (XML) is a markup language much like HTML or SGML. This is recommended by the World Wide Web Consortium and available as an open standard. XML is a portable, open source language that allows programmers to develop applications that can be read by other applications, regardless of operating system and/or developmental language. XML is extremely useful for keeping track of small to medium amounts of data without requiring a SQL-based backbone. XML Parser Architectures and APIs: There are two different flavors available for XML parsers: SAX-like (Stream interfaces) : Here you register callbacks for events of interest and then let the parser proceed through the document.

SAX obviously can't process information as fast as DOM can when working with large files. SAX is read-only, while DOM allows changes to the XML file. Parsing and Creating XML using Ruby: The most common way to manipulate XML is with the REXML library by Sean Russell. Nicholas' Adventures: Nokogiri - Cut With Precision. Many times we as developers have to deal with complex data, be it an ActiveResource result set or a HTML/XML document.

Nicholas' Adventures: Nokogiri - Cut With Precision

Trying to parse data out of these using for each and nesting loops within loops can be cumbersome. A more elegant solution is to use nokogiri and xpath. Nokogiri is a type of Japanese saw, it also is a gem in Ruby that you can use to easily deal with XML or HTML documents. (hint, ActiveRecord and ActiveResource objects both have to_xml methods). You can easily install nokogiri (make sure you have libxml2 development packages installed, as the gem requires these to be properly built). $ sudo gem install nokogiri Now consider the following XML document: foods.xml Before we can work with our data we need to read XML into Nokogiri. > require 'rubygems' > require 'nokogiri' > doc = Nokogiri::XML.parse(File.read('foods.xml')) => #<Nokogiri::XML::Document:0x3f930c9db884 ...

Nokogiri. From a String We’ve tried to make this easy on you.

Nokogiri

Really! We’re here to make your life easier. The variables html_doc and xml_doc are Nokogiri documents, which have all kinds of interesting properties and methods that you can read about here. We’ll cover the interesting bits in other chapters. From a File Note that you don’t need to read the file into a string variable. Clever Nokogiri! From the Internets I understand that there may be some HTML documents available on the World Wide Web. Parse Options Nokogiri offers quite a few options that affect how a document is parsed. NOBLANKS - Remove blank nodesNOENT - Substitute entitiesNOERROR - Suppress error reportsSTRICT - Strict parsing; raise an error when parsing malformed documentsNONET - Prevent any network connections during parsing.

Here’s how they are used: Or Encoding Strings are always stored as UTF-8 internally. Some documents declare one particular encoding, but use a different one. Getting Started with Nokogiri and XML in Ruby. Here's a short post on getting started with Nokogiri - a Ruby gem that wraps libxml.

Getting Started with Nokogiri and XML in Ruby

I'm writing this because well, the docs at Nokogiri kind of suck. I wanted to read a simple XML document. My XPath fu was a little rusty, although all I wanted to do was read some attributes from a root element, some element values off of the root, and then a short collection of items (very similar to an Atom document). My main bone of contention with the Nokogiri docs was their use of the @doc.xpath("//character") search operator at the very beginning of their parsing tutorial. How about we start from the beginning: Here is a sample XML document. <Collection version="2.0" id="74j5hc4je3b9"><Name>A Funfair in Bangkok</Name><PermaLink>Funfair in Bangkok</PermaLink><PermaLinkIsName>True</PermaLinkIsName><Description>A small funfair near On Nut in Bangkok.

From our IRB prompt - the first thing we'll do is require nokogiri. >> require 'nokogiri' => true.