background preloader

Data Wrangling

Facebook Twitter

Intro to 'apply' At any R Q&A site, you’ll frequently see an exchange like this one: Q: How can I use a loop to [...insert task here...] ?

Intro to 'apply'

A: Don’t. Use one of the apply functions. So, what are these wondrous apply functions and how do they work? I think the best way to figure out anything in R is to learn by experimentation, using embarrassingly trivial data and functions. Let’s examine each of those. 1. applyDescription: “Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.” OK – we know about vectors/arrays and functions, but what are these “margins”?

R tutorial on the Apply family of functions. Introduction In our previous tutorial Loops in R: Usage and Alternatives , we discussed one of the most important constructs in programming: the loop.

R tutorial on the Apply family of functions

Eventually we deprecated the usage of loops in R in favor of vectorized functions. Plyr - sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate. In R, extract part of object from list. The tidy tools manifesto. Hadley Wickham This document lays out the consistent principles that unify the packages in the tidyverse.

The tidy tools manifesto

The goal of these principles is to provide a uniform interface so that tidyverse packages work together naturally, and once you’ve mastered one, you have a head start on mastering the others. This is my first attempt at writing down these principles. That means that this manifesto is both aspirational and likely to change heavily in the future. Currently no pacakges precisely meet the design goals, and while the underlying ideas are stable, I expect their expression in prose will change substantially as I struggle to make explicit my process and thinking. There are many other excellent packages that are not part of the tidyverse, because they are designed with a different set of underlying principles. There are four basic principles to a tidy API: Reshaping data. Preparing and reshaping data is the ever continuing task of a data analyst.

Reshaping data

Luckily we have many tools for it. The default tool in R would be reshape(), although this is so user friendly that a reshape package has been added too. I try to use reshape() (the function) because I feel it is a good tool, though with a somewhat cryptical manual. The latter may be because it is written in terms of longitudal data, whereas my experience is conversion of data from easy to enter in Excel to suitable for analysis in R. To exercise myself a bit more I have taken all examples from the SAS transpose procedure and implemented them in R. Examples 1 to 3 These examples are so simple, the best tool is the t() function.

Reshape and aggregate data with the R package reshape2. Creating molten data Instead of thinking about data in terms of a matrix or a data frame where we have observations in the rows and variables in the columns, we need to think of the variables as divided in two groups: identifier and measured variables.

Reshape and aggregate data with the R package reshape2

Identifier variables (id) identify the unit that measurements take place on. In the data.frame below subject and time are the two id variables and age, weight and height are the measured variables. We can go further and say that there are only id variables and a value, where the id variables also identify what measured variable the value represents. Then each row will represent one observation of one variable. All measured variables must be of the same type, e.g., numeric, factor, date. Intro to reshape2. October 19, 2013 reshape2 is an R package written by Hadley Wickham that makes it easy to transform data between wide and long formats.

Intro to reshape2

Converting a dataset from wide to long. I recently had to convert a dataset that I was working with from a wide format to a long format for my analysis.

Converting a dataset from wide to long

I struggled with this a bit, but finally found the right sources and the right package to do it, so I thought I'd share my practical example of reshaping data in R. This post is specifically helpful for those using Demographic and Health Survey (DHS) data. The DHS dataset includes one observation for each woman.

For each observation, there are 20 columns for each birth she could have had for 16 different characteristics. If no birth happened then the cell is left missing. Creating a matrix from a long df. Introducing tidyr. Tidyr is new package that makes it easy to “tidy” your data.

Introducing tidyr

Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are: Each column is a variable.Each row is an observation. Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.

To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. tidyr provides three main functions for tidying your messy data: gather(), separate() and spread(). Tidy data. (This is an informal and code heavy version of the full tidy data paper.

Tidy data

Please refer to that for more details.) Happy families are all alike; every unhappy family is unhappy in its own way — Leo Tolstoy Like families, tidy datasets are all alike but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). Introducing dplyr. Dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. dplyr is the next iteration of plyr, focussing on only data frames. dplyr is faster, has a more consistent API and should be easier to use.

Introducing dplyr

There are three key ideas that underlie dplyr: Your time is important, so Romain Francois has written the key pieces in Rcpp to provide blazing fast performance. Performance will only get better over time, especially once we figure out the best way to make the most of multiple processors.Tabular data is tabular data regardless of where it lives, so you should use the same functions to work with it.

With dplyr, anything you can do to a local data frame you can also do to a remote database table. Lets compare plyr and dplyr with a little example, using the Batting dataset from the fantastic Lahman package which makes the complete Lahman baseball database easily accessible from R. Data Processing with dplyr & tidyr. UI for R — learn dplyr. Introducing Exploratory Desktop — UI for R dplyr is amazing. I immediately fell in love with it when I encountered for the first time because each command interface was simple and beautiful, its use of ‘pipe’ made the data analysis pipeline readable for anybody, and the functionality it provided was already comprehensive and practical for real use cases especially when combined with tidyr.

On top of that, the performance was blazing fast. Argument variations of select() in dplyr. When I read the dplyr vignette, I found a convenient way to select sequential columns such as select(data, year:day). Because I had inputted only column names to select() function, I was deeply affected by the convenient way. On closer inspection, I found that the select() function accepts many types of input. Here, I will enumerate the variety of acceptable inputs for select() function. By the way, these column selection methods also can use in the summarise_each(), mutate_each() and some functions in tidyr package(e.g. gather()). Dplyr's mutate_each within function works but matches() does not find argument. Aggregation with dplyr: summarise and summarise_each. For this article we will use the well known mtcars data frame.

We will first transform it into a tbl_df object; no change will occur to the standard data.frame object but a much better print method will be available. Finally, to keep this article tidy and clean we will select only four variables of interest Case 1: apply one function to one variable In this case, summarise() results the simplest candidate. Cheatsheet for dplyr join functions.

Why the cheatsheet Examples for those of us who don’t speak SQL so good. There are lots of Venn diagrams re: SQL joins on the interwebs, but I wanted R examples. Full documentation for the dplyr package, which is developed by Hadley Wickham and Romain Francois on GitHub. The vignette on Two-table verbs covers the joins shown here.

Working with two small data.frames, superheroes and publishers. suppressPackageStartupMessages(library(dplyr))library(readr) superheroes <-" name, alignment, gender, publisher Magneto, bad, male, Marvel Storm, good, female, Marvel Mystique, bad, female, Marvel Batman, good, male, DC Joker, bad, male, DC Catwoman, bad, female, DC Hellboy, good, male, Dark Horse Comics "superheroes <- read_csv(superheroes, trim_ws = TRUE, skip = 1) publishers <- " publisher, yr_founded DC, 1934 Marvel, 1939 Image, 1992 "publishers <- read_csv(publishers, trim_ws = TRUE, skip = 1) Sorry, cheat sheet does not illustrate “multiple match” situations terribly well. sessionInfo()

GitHub - dgrtwo/fuzzyjoin: Join tables together on inexact matching. Attach, transform, mutate and within. There are several ways to perform data transformations in R. Each has its own set of advantages and disadvantages. Let’s take one variable, square it and add 100. How many ways might an R beginner screw up such a simple computation? Quite a few! With vs. within vs. transform. Rvest: easy web scraping with R. Rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.

R got good at scraping. R may just have become more preferable for simple webscraping jobs with the release of rvest. Data Table Cheat Sheet. Two of my favorite data.table features. When I started to use the data.table package I was primarily using it to aggregate. I had read about data.table and its blazing speed compared to the other options from base or the plyr package especially with large amounts of data. As an example, I remember calculating averages or percentages while at Saint Paul Public Schools and while the calculations were running would walk away for 5 minutes to wait for them to finish. When using data.table to do the same calculations I didn't need to wait 5 minutes to see the calculated values.