background preloader

R Resources

Facebook Twitter

Thinking inside the box. Anytime 0.3.0 A new version of the anytime package is now on CRAN.

Thinking inside the box

It marks the eleventh release since the inaugural version late last summer. anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, ... format to either POSIXct or Date objects -- and to do so without requiring a format string. See the anytime page, or the GitHub for a few examples. This release brings a little more consistency to how numeric or integer arguments are handled.

Courtesy of CRANberries, there is a comparison to the previous release. For questions or comments use the issue tracker off the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Weather forecast with regression models – part 1. In this tutorial we are going to analyse a weather dataset to produce exploratory analysis and forecast reports based on regression models.

Weather forecast with regression models – part 1

We are going to take advantage of a public dataset which is part of the exercise datasets of the “Data Mining and Business Analytics with R” book (Wiley) written by Johannes Ledolter. In doing that, we are taking advantage of several R packages, caret package included. The tutorial is going to be split in the following four parts: 1. Introducing the weather dataset and outlining its exploratory analysis 2. R Packages Overall, we are going to take advantage of the following packages: Exploratory Analysis The dataset under analysis is publicly available at: jledolter – datamining – data exercises It contains daily observations from a single weather station. Metrics at specific Date and Location are given together with the RainTomorrow variable indicating if rain occurred on next day.

The description of the variables is the following. Complete Subset Regressions, simple and powerful. Complete Subset Regressions, simple and powerful By Gabriel Vasconcelos The complete subset regressions (CSR) is a forecasting method proposed by Elliott, Gargano and Timmermann in 2013.

Complete Subset Regressions, simple and powerful

It is as very simple but powerful technique. U.S. Residential Energy Use: Machine Learning on the RECS Dataset. Contributed by Thomas Kassel.

U.S. Residential Energy Use: Machine Learning on the RECS Dataset

He is currently enrolled in the NYC Data Science Academy remote bootcamp program taking place from January-May 2017. This post is based on his final capstone project, focusing on the use of machine learning techniques learned throughout the course. Introduction The residential sector accounts for up to 40% of annual U.S. electricity consumption, representing a large opportunity for energy efficiency and conservation. A strong understanding of the main electricity end-uses in residences can allow homeowners to make more informed decisions to lower their energy bills, help utilities maximize efficiency/incentive programs, and allow governments or NGOs to better forecast energy demand and address climate concerns.

The Residential Energy Consumption Survey, RECS, collects energy-related data on a nationally representative sample of U.S. homes. This project applied machine learning methods to the most recently available RECS dataset, published in 2009. Conclusions. A tidy model pipeline with twidlr and broom. (This article was first published on blogR, and kindly contributed to R-bloggers) @drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.

A tidy model pipeline with twidlr and broom

The problem Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure: Thus, different models may need very different code. However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Two-step modelling To understand the solution, think of the problem as a two-step process, depicted in this Figure: Storytelling with data. Oakland Real Estate – Full EDA.

Living in the Bay Area has led me to think more and more about real estate (and how amazingly expensive it is here…) I’ve signed up for trackers on Zillow and Redfin, but the data analyst in me always wants to dive deeper, to look back historically, to quantify, to visualize the trends, etc… With that in mind, here is my first view at Oakland real estate prices over the past decade.

Oakland Real Estate – Full EDA

I’ll only be looking at multi-tenant units (duplexes, triplexes, etc.) The first plot is simply looking at the number of sales each month: You can clearly see a strong uptick in the number of units sold from 2003 to 2005 and the following steep decline in sales bottoming out during the financial crisis in 2008. Interestingly, sales pick up again very quickly in 2009 and 2010 (a time when I expected to see low sales figures) before stabilizing at the current rate of ~30 properties sold per month. Data Science for Business – Time Series Forecasting Part 1: EDA & Data Preparation. Data Science is a fairly broad term and encompasses a wide range of techniques from data visualization to statistics and machine learning models.

Data Science for Business – Time Series Forecasting Part 1: EDA & Data Preparation

But the techniques are only tools in a – sometimes very messy – toolbox. And while it is important to know and understand these tools, here, I want to go at it from a different angle: What is the task at hand that data science tools can help tackle, and what question do we want to have answered? A straight-forward business problem is to estimate future sales and future income. Create smooth animations in R with the tweenr package. There are several tools available in R for creating animations (movies) from statistical graphics.

Create smooth animations in R with the tweenr package

The animation package by Yihui Xie will create an animated GIF or video file, using a series of R charts you generate as the frames. And the gganimate package by David Robinson is an extension to ggplot2 that will create a movie from charts created using the ggplot2 syntax: in much the same way that you can create multiple panels using faceting, you can instead create an animation with multiple frames. But from a storytelling perspective, such animations can sometimes seem rather disjointed. For example, here's the example (from the gganimate documentation) of crating an animated bubble chart from the gapminder data. Coblis — Color Blindness Simulator. If you are not suffering from a color vision deficiency it is very hard to imagine how it looks like to be colorblind.

Coblis — Color Blindness Simulator

The Color BLIndness Simulator can close this gap for you. Just play around with it and get a felling of how it is to have a color vision handicap. As all the calculations are made on your local machine, no images are uploaded to the server. Therefore you can use images as big as you like, there are no restrictions. Be aware, there are some issues for the “Lens feature” on Edge and Internet Explorer. So go ahead, choose an image through the upload functionality or just drag and drop your image in the center of our Color BLIndness Simulator. If there are any issues with the Color BLIndness Simulator please send a note trouth the contact page.

As it is not not so easy to describe color blindness it comes in handy, that some smart people developed manipulation-algorithms to fake any form of color vision deficiency. The viridis color palettes. Unsupervised Learning and Text Mining of Emotion Terms Using R. Unsupervised learning refers to data science approaches that involve learning without a prior knowledge about the classification of sample data.

Unsupervised Learning and Text Mining of Emotion Terms Using R

In Wikipedia, unsupervised learning has been described as “the task of inferring a function to describe hidden structure from ‘unlabeled’ data (a classification of categorization is not included in the observations)”. The overarching objectives of this post were to evaluate and understand the co-occurrence and/or co-expression of emotion words in individual letters, and if there were any differential expression profiles /patterns of emotions words among the 40 annual shareholder letters? Differential expression of emotion words was being used to refer to quantitative differences in emotion word frequency counts among letters, as well as qualitative differences in certain emotion words occurring uniquely in some letters but not present in others. The dataset Analysis of emotions terms usage. NiceOverPlot, or when the number of dimensions does matter.

Hi there! Over the last few months, my lab-mate Irene Villa (see more of her work here!) And I, have been discussing ecological niche overlap. The niche concept dates back to ideas first proposed by ornithologist J. Grinnell (1917). Later on, G.E. Timekit: Time Series Forecast Applications Using Data Mining. The timekit package contains a collection of tools for working with time series in R. There’s a number of benefits. One of the biggest is the ability to use a time series signature to predict future values (forecast) through data mining techniques. While this post is geared toward exposing the user to the timekit package, there are examples showing the power of data mining a time series as well as how to work with time series in general.

A number of timekit functions will be discussed and implemented in the post. The first group of functions works with the time series index, and these include functions tk_index(), tk_get_timeseries_signature(), tk_augment_timeseries_signature() and tk_get_timeseries_summary(). R for Data Science. Happy Git and GitHub for the useR. Sign Up. Test driving Python integration in R, using the ‘reticulate’ package.

Introduction Not so long ago RStudio released the R package ‘reticulate‘, it is an R interface to Python. Of course, it was already possible to execute python scripts from within R, but this integration takes it one step further. Imported Python modules, classes and functions can be called inside an R session as if it were just native R functions. How to store and use webservice keys and authentication details with R. By Andrie de Vries (@RevoAndrie) I frequently get asked the question how you can safely store login details and passwords for use by R, without exposing these details in your script.

Yesterday Jennifer Bryan asked this question on twitter and a small storm of views and tweets erupted. A few minutes later she tweeted that there clearly is no consensus: Different options. Best practices for writing an API package. So you want to write an R client for a web API? This document walks through the key issues involved in writing API wrappers in R. If you’re new to working with web APIs, you may want to start by reading “An introduction to APIs” by zapier. Overall design APIs vary widely. Before starting to code, it is important to understand how the API you are working with handles important issues so that you can implement a complete and coherent R client for the API.

The key features of any API are the structure of the requests and the structure of the responses. HTTP verb (GET, POST, DELETE, etc.)The base URL for the APIThe URL path or endpointURL query arguments (e.g., ? An API package needs to be able to generate these components in order to perform the desired API call, which will typically involve some sort of authentication. For example, to request that the GitHub API provides a list of all issues for the httr repo, we send an HTTP request that looks like: First steps. Setting your working directory permanently in R.

Fitting a rational function in R using ordinary least-squares regression. Take your data frames to the next level. UK government using R to modernize reporting of official statistics. Like all governments, the UK government is responsible for producing reports of official statistics on an ongoing basis. That process has traditionally been a highly manual one: extract data from government systems, load it into a mainframe statistical analysis tool and run models and forecasts, extract the results to a spreadsheet to prepare data for presentation, and ultimately combine it all in a manual document editing tool to produce the final report.

The process in the UK looks much like this today: Matt Upson, a Data Scientist at the UK Government Digital Service, is looking to modernize this process with a reproducible analytical pipeline. This new process, based on the UK Government's Technology Service Manual for new IT deployments, aims to simplify the process by using R — the open-source programming language for statistical analysis — to automate the data extraction, analysis, and document generation tasks. The one thing you need to master data science. When you ask people what makes a person great – what makes someone an elite performer – they commonly say “talent.” Most people believe that elite performers are born with their talent. Most people believe that top performers come into the world with an innate talent that makes them special.

You see something like this in data science too. People hear about elite data scientists and they assume that these people are just naturally gifted. Selecting columns and renaming are so easy with dplyr. Why I love R Notebooks. Lesser known dplyr tricks – R-bloggers. R Markdown: How to format tables and figures in .docx files. R Markdown: How to number and reference tables. Using knitr and pandoc to create reproducible scientific reports. Analytical and Numerical Solutions to Linear Regression Problems. Maximize manufacturing profit. Optimize! From Descriptive to Prescriptive Analytics. RSQLite: Write a local data frame or file to the database. R and SQLite: Part 1. R: Monitoring the function progress with a progress bar. A wrapper around nested ifelse. Online Text Correction. How to combine multiple CSV files into one using CMD - Markdown Tables generator -

Version Control, File Sharing, and Collaboration Using GitHub and RStudio. The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users. Empirical Software Engineering using R: first draft available for download. Implementation of a basic reproducible data analysis workflow. Principal Component Analysis. Implementation of a basic reproducible data analysis workflow. Endole: Business Information Company Check. How to really do an analysis in R (part 1, data manipulation) - SHARP SIGHT LABS. Ggedit – interactive ggplot aesthetic and theme editor. Ggedit 0.0.2: a GUI for advanced editing of ggplot2 objects.

The Meeting Point Locator. Two meanings of priors, part I: The plausibility of models. Books I like. R - Change default alignment in pander (pandoc.table) The PValues Data Table. Learning Statistics on Youtube. Products. Solarized - Ethan Schoonover. Rguide. GitHub - caesar0301/awesome-public-datasets: An awesome list of high-quality open datasets in public domains (on-going). Qinwf/awesome-R: A curated list of awesome R frameworks, packages and software. Rguide. Plot some variables against many others with tidyr and ggplot2. Express Intro to dplyr. One function to run them all… Or just eval. Virtual Library of Simulation Experiments: Test Functions and Datasets. First steps with Non-Linear Regression in R. CRAN Task View: Design of Experiments (DoE) & Analysis of Experimental Data. Using knitr and pandoc to create reproducible scientific reports. Python Annotated Heatmaps. 100 “must read” R-bloggers’ posts for 2015.

Google scholar scraping with rvest package. A Complete Tutorial on Time Series Modeling in R. Learning R Using a Chemical Reaction Engineering Book: Part 4. Bringing the powers of SQL into R. Using Python and R together: 3 main approaches. Introduction to bootstrap with applications to mixed-effect models. Mixture of Gaussian Distributions. Demonstration of nls function. LsExamples. Learn R : 12 Books (Free PDFs!) and Online Resources - YOU CANalytics. Linear or Nonlinear Regression? That Is the Question. Wandering through the beautiful world of math, computations and visualizations. Online Derivative Calculator. The Yacas computer algebra system. R tips pages. Curve Fitting with Linear and Nonlinear Regression. Nonlinear Regression. Simple Nonlinear Regression. Using R for Time Series Analysis — Time Series 0.2 documentation. Trevor Stephens — Titanic: Getting Started With R.

Data Analytics for Beginners: Part 1. Weekly road fuel prices - Statistical data sets. Using Linear Regression to Predict Energy Output of a Power Plant. All Datasets. Learning Chemical Engineering. District Data Labs - How to Transition from Excel to R. Rounding numbers in Access.