background preloader

Join (SQL)

Join (SQL)
A programmer writes a JOIN statement to identify the records for joining. If the evaluated predicate is true, the combined record is then produced in the expected format, a record set or a temporary table. Relational databases are often normalized to eliminate duplication of information when objects may have one-to-many relationships. For example, a Department may be associated with many different Employees. Note: In the Employee table above, the employee "John" has not been assigned to any department yet. This is the SQL to create the aforementioned tables. CROSS JOIN returns the Cartesian product of rows from tables in the join. Example of an explicit cross join: SELECT *FROM employee CROSS JOIN department; Example of an implicit cross join: SELECT *FROM employee, department; The cross join does not apply any predicate to filter records from the joined table. In the SQL:2011 standard, cross joins are part of the optional F401, "Extended joined table", package. Sybase supports the syntax: Related:  R

Merging Adding Columns To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join). # merge two data frames by ID total <- merge(data frameA,data frameB,by="ID") # merge two data frames by ID and Country total <- merge(data frameA,data frameB,by=c("ID","Country")) Adding Rows To join two data frames (datasets) vertically, use the rbind function. total <- rbind(data frameA, data frameB) If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or Create the additional variables in data frameB and set them to NA (missing) before joining them with rbind( ).

A quick primer on split-apply-combine problems | R-bloggers I’ve just answered my hundred billionth question on Stack Overflow that goes something like I want to calculate some statistic for lots of different groups. Although these questions provide a steady stream of easy points, its such a common and basic data analysis concept that I thought it would be useful to have a document to refer people to. First off, you need to data in the right format. These problems are widely known as split-apply-combine problems after the three steps involved in their solution. First, we split the count column by the spray column. Secondly, we apply the statistic to each element of the list. Finally, (if possible) we recombine the list as a vector. This procedure is such a common thing that there are many functions to speed up the process. sapply and vapply do the last two steps together. We can do even better than that however. tapply, aggregate and by all provide a one-function solution to these S-A-C problems. Tagged: apply, combine, plyr, r, split, statistics

PLOS ONE: Cooperation between Referees and Authors Increases Peer Review Accuracy Abstract Peer review is fundamentally a cooperative process between scientists in a community who agree to review each other's work in an unbiased fashion. Peer review is the foundation for decisions concerning publication in journals, awarding of grants, and academic promotion. Here we perform a laboratory study of open and closed peer review based on an online game. We show that when reviewer behavior was made public under open review, reviewers were rewarded for refereeing and formed significantly more cooperative interactions (13% increase in cooperation, P = 0.018). Citation:Leek JT, Taub MA, Pineda FJ (2011) Cooperation between Referees and Authors Increases Peer Review Accuracy. Editor: Attila Szolnoki, Hungarian Academy of Sciences, Hungary Received: August 24, 2011; Accepted: October 5, 2011; Published: November 9, 2011 Copyright: © 2011 Leek et al. Funding:The authors have no support or funding to report. Introduction Results Theoretical Model Definition of the Peer Review Game. . .

Quick-R: Built-in Functions Almost everything in R is done through functions. Here I'm only refering to numeric and character functions that are commonly used in creating or recoding variables. Numeric Functions Character Functions Statistical Probability Functions The following table describes functions related to probaility distributions. Other Statistical Functions Other useful statistical functions are provided in the following table. Other Useful Functions Note that while the examples on this page apply functions to individual variables, many can be applied to vectors and matrices as well.

Синтаксис регулярных выражений                                                                                                             регулярные выражения,perl,regexp,Delphi,Pascal,FreePascal,Kylix,Libraries,VCL,CLX,Tools,files utils Регулярные выражения - это широкоиспользуемый способ описания шаблонов для поиска текста и проверки соответствия текста шаблону. Специальные метасимволы позволяют определять, например, что Вы ищете подстроку в начале входной строки или определенное число повторений подстроки. На первый взгляд регулярные выражения выглядят страшновато (ну хорошо, на второй - еще страшнее ;) ). Однако Вы очень быстро оцените всю их мощь. Они съэкономят Вам многие часы ненужного кодирования, а в некоторых случаях будут и быстрее работать, чем вручную закодированные проверки. Я настоятельно рекомендую Вам "поиграть" с поставляемой в дистрибутиве демо-программой TestRExp.dpr - это позволит Вам лучше понять принцип работы регулярных выражений и отладить Ваши собственные выражения. Давайте начнем наше знакомство с регулярными выражениями! Простое сравнение Любой символ совпадает с самим собой, если он не относится к специальным метасимволам описанным чуть ниже. Escape-последовательности Примеры: foob. Модификаторы

rfoxfa/Getting_and_Cleaning_Data · GitHub Web Scraping | R-bloggers rvest: easy web scraping with R rvest is new package that makes it easy to scrape (or harvest) data from html web pages, by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with: install.packages("rvest") rvest in action To see rvest Read more » Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples I was offline much of the day Tuesday and completely missed Hadley Wickham’s tweet about the new rvest package: Are you an #rstats user who misses python's beautiful soup? Read more » Web Scraping: working with APIs APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Read more » Web Scraping: Scaling up Digital Data Collection Read more » Web Scraping part2: Digging deeper Read more » Read more » Read more » Read more »

Text Analysis blog | Aylien — Text Analysis 101; A Basic Understanding for... Introduction The automatic classification of documents is an example of how Machine Learning (ML) and Natural Language Processing (NLP) can be leveraged to enable machines to better understand human language. By classifying text, we are aiming to assign one or more classes or categories to a document or piece of text, making it easier to manage and sort the documents. Broadly speaking, there are two classes of ML techniques: supervised and unsupervised. Unsupervised ML techniques differ because they do not require a training dataset, and in case of documents, the categories are not known in advance. What a Classifier does Classifiers make ‘predictions’, that is their job. How a Classifier works As we mentioned classification is about prediction. In this case, we have two “features.” temperature and rain, to help us predict whether the game will be played or not played. In the table above, each column is called a “feature”, the “Play?” A simple Illustration of Document Classification 1. 2.

MySQL and R | R-bloggers Using MySQL with R is pretty easy, with RMySQL. Here are a few notes to keep me straight on a few things I always get snagged on. Typically, most folks are going to want to analyze data that’s already in a MySQL database. Being a little bass-ackwards, I often want to go the other way. The docs are a bit clearer for RS-DBI, which is the standard R interface to relational databases and of which RMySQL is one implementation. Opening and closing connections The best way to close DB connections, like you would do in a finally clause in Java, is to use on.exit, like this: con <- dbConnect(MySQL(), user="me", password="nuts2u", dbname="my_db", host="localhost") on.exit(dbDisconnect(con)) Building queries Using sprintf to build the queries feels a little primitive. Processing query results You can process query results row by row, in blocks or all at once. Retrieving AUTO_INCREMENT IDs A standard newbie question with MySQL is how to retrieve freshly generated primary keys from AUTO_INCREMENT fields.

Installing RMySQL under Windows | Arne Hendrik Schulz Update 2015-01-02: I slightly updated this tutorial based on the comments. Update 2014-12-16: This tutorial also works on Windows 8.1! Connecting R with MySQL can be somewhat difficult using Windows. The package RMySQL is not available as a precompiled zip-archive. It needs the installed libmysqll.dll library to be working and must therefore be compiled on your machine. Step 1: Requirements R (surprise! Step 2: Setup and Configuring I assume that you already have an installed R and the MySQL-Server. The next window is really arkward. Editing the System Path is important and tells R where the compiler is located. Step 3: Tell Windows (and R) where the MySQL-Libraries are RMySQL needs the libmysql.dll to compile successfully. Telling R where to find the libraries can be done by three different options: Command line, environment file (.Renviron) or environment variables. Your environment variables are located under Control Panel → User Accunts → Change my environment variables. ).

Do more with dates and times in R with lubridate 1.1.0 This is a guest post by Garrett Grolemund (mentored by Hadley Wickham) Lubridate is an R package that makes it easier to work with dates and times. The newest release of lubridate (v 1.1.0) comes with even more tools and some significant changes over past versions. Below is a concise tour of some of the things lubridate can do for you. At the end of this post, I list some of the differences between lubridate (v 0.2.4) and lubridate (v 1.1.0). Lubridate was created by Garrett Grolemund and Hadley Wickham. Parsing dates and times Getting R to agree that your data contains the dates and times you think it does can be a bit tricky. Parsing functions automatically handle a wide variety of formats and separators, which simplifies the parsing process. If your date includes time information, add h, m, and/or s to the name of the function. ymd_hms() is probably the most common date time format. Setting and Extracting information Time Zones His call would arrive at 2:00 am my time! Time Intervals

Do more with dates and times in R with lubridate 1.3.0 note: This vignette is an updated version of the blog post first published at r-statistics Lubridate is an R package that makes it easier to work with dates and times. Below is a concise tour of some of the things lubridate can do for you. Lubridate was created by Garrett Grolemund and Hadley Wickham. Parsing dates and times Getting R to agree that your data contains the dates and times you think it does can be tricky. library(lubridate)ymd("20110604") mdy("06-04-2011") dmy("04/06/2011") Lubridate's parse functions handle a wide variety of formats and separators, which simplifies the parsing process. If your date includes time information, add h, m, and/or s to the name of the function. ymd_hms is probably the most common date time format. arrive <- ymd_hms("2011-06-04 12:00:00", tz = "Pacific/Auckland")arrive leave <- ymd_hms("2011-08-10 14:00:00", tz = "Pacific/Auckland")leave Setting and Extracting information second(arrive) second(arrive) <- 25arrive second(arrive) <- 0 wday(arrive) Time Zones

Week 4 Contents References Dalgaard 2008 Wickham, H. 2009 ggplot2: Elegant graphics for data analysis Paradis, E. 2005 R for Beginners [PDF] R Graph Gallery Tufte, E. 2001 The Visual Display of Quantitative Information To install today install.packages("RColorBrewer") Base Graphics As Wickham points out in his book on his R graphics package, the R base graphics system has a pen on paper design. Plotting Basics The most basic command you can use to produce a plot is plot(). Plot Types There are 9 basic ways for R to plot a set of points. p: points (this is the default) l: a line b: "both", points connected by line segments c: just the connecting segments of "b" o: "overplotted", points and lines overplotted h: a histogram s: stair S: alternative stairs n: none Here's what the output looks like: Point Types There are 25 different plotting points that can be used by defining the argument pch (which I believe stands for "point character"). Line Types Weights and Sizes Colors Color Brewer Building up Plots par()