background preloader

Datacleaning

Facebook Twitter

The R Project for Statistical Computing. Reshape. R provides a variety of methods for reshaping data prior to analysis. Transpose Use the t() function to transpose a matrix or a data frame. In the later case, rownames become variable (column) names. # example using built-in dataset mtcars t(mtcars) The Reshape Package Hadley Wickham has created a comprehensive package called reshape to massage data.

Basically, you "melt" data so that each row is a unique id-variable combination. Mydata # example of melt function library(reshape) mdata <- melt(mydata, id=c("id","time")) newdata # cast the melted data # cast(data, formula, function) subjmeans <- cast(mdata, id~variable, mean) timemeans <- cast(mdata, time~variable, mean) subjmeans timemeans There is much more that you can do with the melt( ) and cast( ) functions. Reshaping data in R « Duncan Golicher’s weblog. The text below lacks graphics and tables, However just click on the blue first paragraph to download it all as a PDF . Duncan Golicher One of the most frustrating and time consuming parts of statistical analysis is shuffling data into a format for analysis. No one enjoys changing data formats. Researchers want to get results, finish the task, move on. Routine reformatting of data is made difficult by complications that could have been avoided.

Students who are not instructed in data management, or those who ignore instruction, use spreadsheets to hold data. Students should be taught to collect and hold data in a standardized “long” form in which there is only one variable for each form of measurement from the start. There are many tricks for handling data in R using tapply, lapply, stack, aggregate, by and a range of other functions. Visualizing data in long format using R Or the boxplots can be placed the other way around. Reshaping data. Transformation.pdf (application/pdf Object) Exploratory Data Analysis With SPSS. When many independent random factors act in an additive manner to create variability, the dataset follows a bell-shaped distribution called the normal (or Gaussian distribution, after Carl Friedrich Gauss, 1777-1855): The normal distribution has some special mathematical properties which form the basis of many statistical tests.

Although no real datasets follow the normal distribution exactly, many kinds of data follow a distribution that is approximately Gaussian. A normal distribution can be defined by two parameters, the mean and the standard deviation. By definition, normal frequency distributions are continuous (not bimodal). Of course, not all datasets follow a normal distributions, e.g. Binomial distribution: A probability distribution of binary variables. How to recognize a normal (& non-normal) distribution: In a perfect normal frequency distribution, the mean, median and mode are equal. Parametric & Nonparametric Methods: t-test ANOVA many others Why perform EDA? EDA includes: Degrees of Freedom. Degrees of Freedom Gerard E. Dallal, Ph.D. [Early draft subject to change.] One of the questions an instrutor dreads most from a mathematically unsophisticated audience is, "What exactly is degrees of freedom?

" It's not that there's no answer. The mathematical answer is a single phrase, "The rank of a quadratic form. " The problem is translating that to an audience whose knowledge of mathematics does not extend beyond high school mathematics. As an alternative to "the rank of a quadratic form", I've always enjoyed Jack Good's 1973 article in the American Statistician "What are Degrees of Freedom?

" At the moment, I'm inclined to define degrees of freedom as a way of keeping score. A single sample: There are n observations. Two samples: There are n1+n2 observations. One-way ANOVA with g groups: There are n1+.. The primary null hypothesis being tested by one-way ANOVA is that the g population means are equal. There is another way of viewing the numerator degrees of freedom for the F ratio. Data Mining, Statistical Analysis, Software and Services, Credit Scoring | StatSoft. Statistics.com.