Simply Statistics

Related: Data science blogs

Linguistics and Data Science Cours Data Mining - Data Science, Big Data Analytics Contenu et objectifs du cours DATA MINING - DATA SCIENCE Data Mining Le DATA MINING , raccourci de "Extraction de Connaissances à partir de Données" ("Knowledge Discovery in Databases" en anglais - KDD), est un domaine très en vogue. A la lecture des différents documents essayant tant bien que mal de définir exactement ce qu'est le data mining, on peut se dire que, finalement, cela fait plus de 30 ans qu'on le pratique avec ce qu'on appelle l'analyse de données et les statistiques exploratoires. Et on n'aurait pas complètement tort. En réalité, ce n'est pas aussi simple, le data mining emmène plusieurs points nouveaux qui sont loin d'être négligeables : (1) des techniques d'analyse qui ne sont pas dans la culture des statisticiens, en provenance de l'apprentissage automatique (Intelligence artificielle), de la reconnaissance de formes (pattern recognition) et des bases de données ; (2) l'extraction de connaissances est intégrée dans le schéma organisationnel de l'entreprise. Public visé

Handling Large Datasets In R Handling large dataset in R, especially CSV data, was briefly discussed before at Excellent free CSV splitter and Handling Large CSV Files in R. My file at that time was around 2GB with 30 million number of rows and 8 columns. Recently I started to collect and analyze US corporate bonds tick data from year 2002 to 2010, and the CSV file I got is 6.18GB with 40 million number of rows, even after removing biases data as in Biases in TRACE Corporate Bond Data. How to proceed efficiently? BTW, determining the number of rows of a very big file is tricky, you don’t have to load the data first and use dim(), which easily leads to short of memory. data <- gzfile("yourdata.zip",open="r")MaxRows <- 50000TotalRows <- 0while((LeftRow <- length(readLines(data,MaxRows))) > 0 )TotalRows <- TotalRows+LeftRowclose(data) Tags – data , csvRead the full post at Handling Large Datasets in R. Related Handling Large CSV Files in R A follow-up of my previous post Excellent Free CSV Splitter. August 9, 2010

Home Supports de cours -- Data Mining et Data Science Cette page recense les supports utilisés pour mes enseignements de Machine Learning, Data Mining et de Data Science au sein du Département Informatique et Statistique (DIS) de l'Université Lyon 2, principalement en Master 2 Statistique et Informatique pour la Science des donnéEs (SISE), formation en data science, dans le cadre du traitement statistique des données et de la valorisation des big data. Je suis très attentif à la synergie forte entre l'informatique et les statistiques dans ce diplôme, ce sont là les piliers essentiels du métier de data scientist. Attention, pour la majorité, il s'agit de « slides » imprimés en PDF, donc très peu formalisés, ils mettent avant tout l'accent sur le fil directeur du domaine étudié et recensent les points importants. Cette page est bien entendu ouverte à tous les statisticiens, data miner et data scientist, étudiants ou pas, de l'Université Lyon 2 ou d'ailleurs. Nous vous remercions par avance. Ricco Rakotomalala – Université Lyon 2

New release: Choroplethr v3.2.0 - AriLamstein.com Today I am happy to announce that a new version of choroplethr, v3.2.0, is now available. You can get it by typing the following from an R console: install.packages("choroplethr") Note that it sometimes takes a few days for new packages to get copied to each CRAN mirror. If install.packages(“choroplethr”) only gets you version 3.1.0, please try again tomorrow. This version contains three changes. Change #1: Better Default Projection The most significant change is the addition of a better default map projection. library(choroplethr) data(df_pop_county) df_pop_county$value=NA new = county_choropleth(df_pop_county, title = "New Default") old = CountyChoropleth$new(df_pop_county) old$title = "Old Default" old$projection = element_blank() old = old$render() library(gridExtra) grid.arrange(new, old, ncol=2) Change #2: Better Border Control On maps with many small regions, the borders can obscure information. In previous version of choroplethr it was hard to make the right-hand map. In Other News

The Unofficial Google Data Science Blog Blog - AriLamstein.com Today’s guest post is by Julia Silge. After reading Julia’s analysis of religions in America (“This is the Place, Apparently“) I invited her to teach my readers how to map information about US Religious Adherence by County in R. Julia can be found blogging here or on Twitter. I took Ari’s free email course for getting started with the choroplethr package last year, and I have so enjoyed making choropleth maps and using them to explore demographic data. The Association of Statisticians of American Religious Bodies (ASARB) publishes data on the number of congregations and adherents for religious groups for each county in the United States. The file made available at the Association of Religion Data Archives is an SPSS file so we’ll need to use the foreign library to access the file. Total adherents/adherence rates including all the religious groupsEvangelical ProtestantBlack ProtestantMainline ProtestantCatholicOrthodoxOther Now let’s make a map!