background preloader

R working with large data sets

Facebook Twitter

Wsopuppenkiste.wiso.uni-goettingen.de/ff/ff_1.0/inst/doc/ff.pdf. Save.ffdf and load.ffdf: Save and load your big data – quickly and neatly! I’m very indebted to the ff and ffbase packages in R. Without them, I probably would have to use some less savoury stats program for my bigger data analysis projects that I do at work.

Since I started using ff and ffbase, I have resorted to saving and loading my ff dataframes using ffsave and ffload. The syntax isn’t so bad, but the resulting process it puts your computer through to save and load your ff dataframe is a bit cumbersome. It takes a while to save and load, and ffsave creates (by default) a bunch of randomly named ff files in a temporary directory. For that reason, I was happy to come across a link to a pdf presentation (sorry, I’ve lost it now) summarizing some cool features of ffbase. I learned that instead of using ffsave and ffload, you can use save.ffdf and load.ffdf, which have very simple syntax: save.ffdf(ffdfname, dir=”/PATH/TO/STORE/FF/FILES”) load.ffdf(dir=”/PATH/TO/STORE/FF/FILES”) As simple as that, you load your files, and you’re done!

Store your big data!! Analyzing birth rates from census data from RevoScaleR. New Features in Revolution R Enterprise 5.0 (including RevoScaleR) to Support Scalable Data Analysis Webinar. Www.statistik.uni-dortmund.de/useR-2008/slides/Elff.pdf. R Package ColbyCol. Stepping up to Big Data with R and Python: A Mind Map of All the Packages You Will Ever Need. On May 8, we kicked off the transformation of R Users DC to Statistical Programming DC (SPDC) with a meetup at iStrategyLabs in Dupont Circle. The meetup, titled “Stepping up to big data with R and Python,” was an experiment in collective learning as Marck and I guided a lively discussion of strategies to leverage the “traditional” analytics stack in R and Python to work with big data.

R and Python are two of the most popular open-source programming languages for data analysis. R developed as a statistical programming language with a large ecosystem of user-contributed packages (over 4500, as of 4/26/2013) aimed at a variety of statistical and data mining tasks. Python is a general programming language with an increasingly mature set of packages for data manipulation and analysis. Both languages have their pros and cons for data analysis, which have been discussed elsewhere, but each is powerful in its own right. Most data scientists have had experience with small to medium data. Instructions for Installing & Using R on Amazon EC2 | randyzwitch.com. If you’re an R user, you’ve surely heard all the hype around ‘big data’ and how R is commonly used to analyze these volumes of data. One thing that’s often missing from the discussion is HOW to work around issues using big data and R, specifically how to deal with the fact that R stores all its objects in-memory.

While you can use packages such as ff and bigmemory to overcome the in-memory limits of your local machine, these additional packages do require some re-engineering of your code. Instead, consider using Amazon EC2 to provision the resources you need. Here are two ways to get started… Use a Pre-Made AMI In the great open-source tradition, there are already R Amazon EC2 AMI images available out there to use. The way I got started was using the pre-built images that Louis Aslett provides on his website. Build Your Own Image Alternatively, suppose you want to build your own customized image. Setting Up Amazon EC2 Instance Launch an Ubuntu 12.04.1 LTS 64-bit image. Installing Base R. Analyzing Your Data on the AWS Cloud (with R) Guest post by Jonathan Rosenblatt Disclaimer: This post is not intended to be a comprehensive review, but more of a “getting started guide”. If I did not mention an important tool or package I apologize, and invite readers to contribute in the comments.

Introduction I have recently had the delight to participate in a “Brain Hackathon” organized as part of the OHBM2013 conference. Being supported by Amazon, the hackathon participants were provided with Amazon credit in order to promote the analysis using Amazon’s Web Services (AWS). While imaging genetics is an interesting research topic, and the hackathon was a great idea by itself, it is the AWS I wish to present in this post. Storing your data and analyzing it on the cloud, be it AWS, Azure, Rackspace or others, is a quantum leap in analysis capabilities. As motivation for analysis in the cloud consider: Here is a quick FAQ before going into the setup stages. Q: How does R fit in? A: Very naturally. Q: Isn’t this expensive? Remark: Bit&ff2.1-2_WU_Vienna2010.pdf (objeto application/pdf) Ff&bit_UseR!2009.pdf (objeto application/pdf) Big data for R. Revolutions Analytics recently announced their "big data" solution for R. This is great news and a lovely piece of work by the team at Revolutions.

However, if you want to replicate their analysis in standard R, then you can absolutely do so and we show you how. Data preparation First you need to prepare the rather large data set that they use in the Revolutions white paper. The preparation script shown below does two passes over alal the files which is not needed: changing it to a single pass is left as an exercise for the reader.... Note that the following script will take a while to run and will need some 30-odd gig of free disk space (another exercise: get rid of the airlines.csv file), but once it is done the analysis is fast.

Sample analysis All done now. Just like the Revolutions paper. I must admit here that I do not understand the Revolutions regression example, so I have not attempted to replicate it here.