background preloader

WebScraping with R

Facebook Twitter

Web-Scraper for Google Scholar updated! Scraping table from any web page with R or CloudStat. (This article was first published on PR, and kindly contributed to R-bloggers) Scraping table from any web page with R or CloudStat: You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.

Thanks to XML package from R. It provides amazing readHTMLtable() function. For a study case, I want to scrape data: A. Code: airline = ‘ Result: > library(XML) Warning message: package "XML" was built under R version 2.14.1 > airline = " B. Chess = ‘ > chess = " Done. Then, you can analyze as usual! Tags: scrape, scraping, data collection To leave a comment for the author, please follow the link and comment on his blog: PR. Retrieving RSS Feeds Using Google Reader. Scraping R-bloggers with Python – Part 2. In my previous post I showed how to write a small simple python script to download the pages of R-bloggers.com. If you followed that post and ran the script, you should have a folder on your hard drive with 2409 .html files labeled post1.html , post2.html and so forth.

The next step is to write a small script that extract the information we want from each page, and store that information in a .csv file that is easily read by R. In this post I will show how to extract the post title, author name and date of a given post and store it in a .csv file with a unique id. To do this open a document in your favorite python editor (I like to use aquamacs) and name it: extraction.py. As in the previous post we start by importing the modules that we will use for the extraction: from BeautifulSoup import BeautifulSoup import os import re As in the previous post we will be using the BeautifulSoup module to extract the relevant information from the pages. Os.chdir(path) data = {} keys = data.keys() Unshorten any URL with R. Introduction I was asked by a friend how to find the full final address of an URL which had been shortened via a shortening service (e.g., Twitter’s t.co, Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, Ow.ly, etc.).

I replied I had no idea and maybe he should have a look over on StackOverflow.com or, possibly, the R-help list, and if that didn’t turn up anything to try an online unshortening service like Two minutes later he came back with this solution from Stack Overflow which, surpsingly to me, contained an answer I had provided about 1.5 years ago! This has always been my problem with programming, that I learn something useful and then completely forget it. I’m kind of hoping that by having this blog it will aid me in remembering these sorts of things. The Objective I want to decode a shortened URL to reveal it’s full final web address. The Solution And here’s how we use it: Limitations. Getting web data through R. Next Level Web Scraping. The outcome presented above will not be very useful to most of you - still, this could be a good example for what possibly can be done via web scraping in R.

Background: TIRIS is the federal geo-statistical service of North-Tyrol, Austria. One of many great things it provides are historical and recent aerial photographs. These photographs can be addressed via URL's. This is the basis of the script: the URL's are retrieved, some parameters are adjusted, using the customized addresses images are downloaded and animated by saveHTML from the Animation-Package. The outcome (HTML-Animation) enables you to view and skip through aerial photographs of any location in North-Tyrol, from the year 1940 to 2010, and see how the landscape, buildings, etc. have changed... View the script HERE. To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*. Web scraping with Python – the dark side of data. In searching for some information on web-scrapers, I found a great presentation given at Pycon in 2010 by Asheesh Laroia. I thought this might be a valuable resource for R users who are looking for ways to gather data from user-unfriendly websites.

The presentation can be found here: Highlights (at least from my perspective) Screen scraping is not about regular expressions. The mechanise package features heavily in the examples for this presentation. There was also some mention of how Javascript causes problems for web scrapers, although this problem can be overcome via the use of web-drivers such as Selenium (see and Watir. Please feel free to post your comments about your experiences with screen scraping, and other tools that you use to collect web data for R. R-Function GScholarScraper.

Google Scholar (Partial Success) Google Scholar: Part2. Get_google_scholar_df <- function(u) { html <- getURL(u) doc <- htmlParse(html) GS_xpathSApply <- function(doc, path, FUN) { path.base <- "/html/body/div[@class='gs_r']" nodes.len <- length(xpathSApply(doc, "/html/body/div[@class='gs_r']")) paths <- sapply(1:nodes.len, function(i) gsub( "/html/body/div[@class='gs_r']", paste("/html/body/div[@class='gs_r'][", i, "]", sep = ""), path, fixed = TRUE)) xx <- sapply(paths, function(xpath) xpathSApply(doc, xpath, FUN), USE.NAMES = FALSE) xx[sapply(xx, length)<1] <- NA xx <- as.vector(unlist(xx)) return(xx) df <- data.frame( footer = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']", xmlValue), title = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue), type = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/span", xmlValue), publication = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_a']", xmlValue), stringsAsFactors = FALSE) df <- df[,-1]

GScholarScraper function with XPath. Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation. This was of interest to me because around the same time I posted an independent Google Scholar scraper function get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions).

My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage into different columns of a dataframe structure. In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. I think that’s pretty much everything I added. Anyway, here’s how it works (link to full code at end of post): // image //image Not bad. Code: An R function to analyze your Google Scholar Citations page. Google scholar has now made Google Scholar Citations profiles available to anyone. You can read about these profiles and set one up for yourself here. I asked John Muschelli and Andrew Jaffe to write me a function that would download my Google Scholar Citations data so I could play with it.

Then they got all crazy on it and wrote a couple of really neat functions. All cool/interesting components of these functions are their ideas and any bugs were introduced by me when I was trying to fiddle with the code at the end. So how does it work? Source(“ This will install the following packages if you don’t have them: wordcloud, tm, sendmailR, RColorBrewer. You can then call the googleCite function like this: out = googleCite(“ or search by name like this: out = searchCite(“Rafa Irizarry”,pdfname=”rafa_wordcloud.pdf”) gcSummary(out) Enjoy! Web Scraping Google URLs. Google slightly changed the html code it uses for hyperlinks on search pages last Thursday, thus causing one of my scripts to stop working. Thankfully, this is easily solved in R thanks to the XML package and the power and simplicity of XPath expressions: Lovely jubbly! P.S. I know that there is an API of some sort for google search but I don’t think anyone has made an R package for it.

Yet. (I feel my skill set is insufficient to do it myself!) R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Google+ via XPath. Google+ just opened up to allow brands, groups, and organizations to create their very own public Pages on the site. This didn’t bother me to much but I’ve been hearing a lot about Google+ lately so figured it might be fun to set up an XPath scraper to extract information from each post of a status update page. I was originally going to do one for Facebook but this just seemed more interesting, so maybe I’ll leave that for next week if I get time.

Anyway, here’s how it works (full code link at end of post): You simply supply the function with a Google+ post page url and it scrapes whatever information it can off of each post on the page. <span role="button" title="Load more posts" tabindex="0" style="">More</span> but how one would use that is beyond me. The full code can be found here: P.S. Yahoo Search Page via XPath. Library(RCurl) library(XML) get_yahoo_search_df <- function(u) { xpathSNullApply <- function(doc, path.base, path, FUN, FUN2 = NULL) { nodes.len <- length(xpathSApply(doc, path.base)) paths <- sapply(1:nodes.len, function(i) gsub( path.base, paste(path.base, "[", i, "]", sep = ""), path, fixed = TRUE)) xx <- lapply(paths, function(xpath) xpathSApply(doc, xpath, FUN)) if(! Xx[sapply(xx, length)<1] <- NA xx <- as.vector(unlist(xx)) return(xx) html <- getURL(u, followlocation = TRUE) doc <- htmlParse(html) path.base <- "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li" df <- data.frame( title = xpathSNullApply(doc, path.base, "/html/body/div[@id='doc']/div[@id='bd-wrap']/div[@id='bd']/div[@id='results']/div[@id='cols']/div[@id='left']/div[@id='main']/div[@id='web']/ol/li/div/div/h3/a", xmlValue), stringsAsFactors = FALSE) free(doc) return(df) df <- get_yahoo_search_df(u) t(df[1:5, ])