background preloader



Data Science Wars: Python vs. R As I frequently travel in data science circles, I’m hearing more and more about a new kind of tech war: Python vs. R. I’ve lived through many tech wars in the past, e.g. Windows vs. Linux, iPhone vs. While R has traditionally been the programming language of choice for data scientists, some believe it is ceding ground to Python. R is Too Complex The most frequently stated argument I’ve heard is the view that Python is general purpose and comparatively easy to learn whereas R remains a somewhat complex programming environment to master. When I first learned R, I did not find it particularly complex; it was a lot easier for me to learn R than C++ or Java with their mammoth frameworks. R Isn’t Really a Language Another argument says that part of the reason people struggle to learn R is that it’s not really a language. Python is More Approachable Some feel that Python is more approachable. Remember, R is a very old statistical environment that has an incredible global following.

Detecting multicollinearity using variance inflation factors | STAT 501 - Regression Methods Printer-friendly version Okay, now that we know the effects that multicollinearity can have on our regression analyses and subsequent conclusions, how do we tell when it exists? That is, how can we tell if multicollinearity is present in our data? Some of the common methods used for detecting multicollinearity include: The analysis exhibits the signs of multicollinearity — such as, estimates of the coefficients vary from model to model. Looking at correlations only among pairs of predictors, however, is limiting. What is a variation inflation factor? As the name suggests, a variance inflation factor (VIF) quantifies how much the variance is inflated. Let's be a little more concrete. it can be shown that the variance of the estimated coefficient bk is: Note that we add the subscript "min" in order to denote that it is the smallest the variance can be. Let's consider such a model with correlated predictors: How much larger? An example the matrix plot of BP, Dur, Pulse, and Stress:

Crime data exploration in R using ggplot2 - Active Analytics Introduction The purpose of this blog post is to outline some exploratory plots using crime data, available from website and the ggplot2 package in R. The ggplot2 package is a plotting and graphics package written for R by Hadley Wickham. Its great looking plots and impressive flexibility have made it a popular amongst R coders. Though this blog post has been created for crime data, the principles can be extended to analysis of many different data sets. Before I begin there are two items to cover: 1. 2. The Data The data used in this plotting tutorial was from the website. #We load some packages # Our plotting tool require(ggplot2) # For arranging the plots require(gridExtra) # For manipulating the plot scales require(scales) # For generting our svg files require(grDevices) options("stringsAsFactors" = TRUE) # Path to the folder holding the data csv path <- "C:\\ btpData <- read.csv(file = paste(path, "BTP-Dec-2012.csv", sep = ""), header = TRUE) The dimensions of the table ...

Top 100 R packages for 2013 (Jan-May)! What are the top 100 (most downloaded) R packages in 2013? Thanks to the recent release of RStudio of their “0-cloud” CRAN log files (but without including downloads from the primary CRAN mirror or any of the 88 other CRAN mirrors), we can now answer this question (at least for the months of Jan till May)! By relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and visualize the popularity of R packages over time. Top 8 most downloaded R packages – downloads over time Let’s first have a look at the number of downloads per day for these 5 months, of the top 8 most downloaded packages (click the image for a larger version): We can see the strong weekly seasonality of the downloads, with Saturday and Sunday having much fewer downloads than other days. “Family tree” of the top 100 most downloaded R packages Such analysis can (and should!) R code Related

swirl - Instructors swirl is a platform for teaching R programming and data science. However, an educational platform is only as good as the content it delivers to students. Although we have contributed some content ourselves, swirl is designed in such a way that you can create your own interactive content and share it freely with students in your classroom or around the world. The swirlify R package provides a comprehensive toolbox for swirl instructors. Step 1: Get R In order to run swirl and swirlify, you must have R 3.0.2 or later installed on your computer. If you need to install R, you can do so here. For help installing R, check out one of the following videos (courtesy of Roger Peng at Johns Hopkins Biostatistics): Step 2 (recommended): Get RStudio In addition to R, it’s highly recommended that you install RStudio, which will make your experience with R much more enjoyable. If you need to install RStudio, you can do so here. Step 3: Install swirl and swirlify Step 4: Start swirlify To create a new lesson:

R-GIS-tutorial/ at master · Pakillo/R-GIS-tutorial