background preloader

Data Sets

Data Sets
Related:  Data Showdown

The Journalist-Engineer A couple months ago, I published an article comparing historic and present-day popularity of older music. I used two huge datasets: 50,000 Billboard songs and 1,4M tracks on Spotify. If I were writing an academic paper, I’d do a ton of analysis, regression, and modeling to figure out why certain songs have become more popular over time. Or I could just make some sick visualizations… Instead of reporting on my “theory”, I wagered that readers would get more out of an elegant presentation of the data, not an analysis of it. Here’s that same approach on another project: rappers and the size of their vocabulary. Instead of proving that one rapper was better than another, readers are really good at absorbing the data, and they’d much rather form their own judgements. A few years ago, Bret Victor wrote about the notion of passive and active readers: In theory, this sounds great…but kinda crazy. But it’s happening — there are active readers. I believe it’s a response to “too long, didn’t read.”

UCI KDD Archive Public Data Sets A data set containing Google Books n-gram corpora. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from Last Modified: Jan 12, 2015 21:46 PM GMT High resolution climate data to help assess the impacts of climate change primarily on agriculture. These open access datasets of climate projections will help researchers make climate change impact assessments. Last Modified: Dec 8, 2014 18:49 PM GMT A corpus of web crawl data composed of over 5 billion web pages. Last Modified: Mar 17, 2014 17:51 PM GMT Three NASA NEX datasets are now available, including climate projections and satellite images of Earth. Last Modified: Nov 12, 2013 13:27 PM GMT The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available. Last Modified: Oct 8, 2013 14:38 PM GMT

Artificial Intelligence Million Song Dataset | scaling MIR research The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are: To encourage research on algorithms that scale to commercial sizesTo provide a reference dataset for evaluating researchAs a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The Million Song Dataset is also a cluster of complementary datasets contributed by the community: The Million Song Dataset started as a collaborative project between The Echo Nest and LabROSA. How to get started To get a sense of the dataset, you can look at this description of one of the million songs. To start your own experiments, you can download the entire dataset (280 GB). We also have a set of suggested tasks, including snippets of code to get you started.

What is the Marital Status of Americans by Age? Visualization Data Notes A few months ago I created a visualization that allowed users to compare age distributions for various topics and another one that showed marital status by age range. Marital Status Sex Pretty generic question here. Race The ACS has six basic race categories. Employment Status This fields is broken out to let you see not only who is in the labor force and who isn’t, but it allows you to see age of those who are employed in the Armed Forces as well. State Geography often is associated with different trends. Libros y tutoriales SIG Un pequeño problema que enfrentan los usuarios de Sistemas de Información Geográfica de habla hispana, no es la abundancia de información disponible en español, más bien es lo contrario si se muestra cierta resistencia a explorar documentación en otros idiomas (especialmente en inglés), pero bueno aquí vamos intentar compartir libros, tutoriales, manuales, artículos en PDF, y lo mejor en nuestro idioma que tanto nos gusta, se actualizarán o agregarán los documentos de acuerdo a su disponibilidad. Manual Básico de ArcGIS 10.- Contiene 148 páginas divididas en diez capítulos, con un contenido rico en fundamentos teóricos e imágenes. Autor: Ronald Puerta, et al.Sistemas de Información Geográfica.- Este libro trata sobre la mayoría de los aspectos teóricos que involucran a los SIG, tratando conceptos y metodologías con independencia del software con el que trabajes. Si los enlaces se encuentra caídos, deseas hacer alguna sugerencia o compartir material, no dudes en dejar un comentario.

CS 229: Machine Learning (Course handouts) Lecture notes 1 (ps) (pdf) Supervised Learning, Discriminative Algorithms Lecture notes 2 (ps) (pdf) Generative Algorithms Lecture notes 3 (ps) (pdf) Support Vector Machines Lecture notes 4 (ps) (pdf) Learning Theory Lecture notes 5 (ps) (pdf) Regularization and Model Selection Lecture notes 6 (ps) (pdf) Online Learning and the Perceptron Algorithm. (optional reading) Lecture notes 7a (ps) (pdf) Unsupervised Learning, k-means clustering. Lecture notes 7b (ps) (pdf) Mixture of Gaussians Lecture notes 8 (ps) (pdf) The EM Algorithm Lecture notes 9 (ps) (pdf) Factor Analysis Lecture notes 10 (ps) (pdf) Principal Components Analysis Lecture notes 11 (ps) (pdf) Independent Components Analysis Lecture notes 12 (ps) (pdf) Reinforcement Learning and Control Supplemental notes 1 (pdf) Binary classification with +/-1 labels. Supplemental notes 2 (pdf) Boosting algorithms and weak learning.

Alternative Interfaces how fast does miles teller play in whiplash EDIT 05 Sep. 2015: The concept of Beat Per Minutes (BPM) has been mis-understood as mentioned by reddit. What I was supposed to write was Strokes Per Minutes (SPM). Released in 2014, Whiplash focuses on a promising young drummer (Miles Teller) pursuing his dream of greatness. I am unfortunately not a musician, nor an enlightened enthusiast, so what strikes me the most is the strong ability of Miles Teller to play quite fast. The metric used is the Beat Per Minutes (BPM) which, in the case of the drum, simplified to how many times the drummer hits his instrument per minutes. Now let's see how the Miles performs in the first see of the movie. The BPM of the final scene has also been studied (from 2:36 to 3:56 of the embedded video). Truly fast in my opinion. And finally, let's look at the BPM of the challenge given by the World's Fastest Drummer to Miles Teller. But as Miles said “Why would you challenge a guy who played in some garage bands in Florida and has a fun time doing it?

Data from: Three keys to the radiation of angiosperms into freezing environments - Dryad When using this data, please cite the original publication: Zanne AE, Tank DC, Cornwell WK, Eastman JM, Smith SA, FitzJohn RG, McGlinn DJ, O'Meara BC, Moles AT, Reich PB, Royer DL, Soltis DE, Stevens PF, Westoby M, Wright IJ, Aarssen L, Bertin RI, Calaminus A, Govaerts R, Hemmings F, Leishman MR, Oleksyn J, Soltis PS, Swenson NG, Warman L, Beaulieu JM, Ordonez A (2014) Three keys to the radiation of angiosperms into freezing environments. Nature 506(7486): 89–92. Additionally, please cite the Dryad data package: Zanne AE, Tank DC, Cornwell WK, Eastman JM, Smith SA, FitzJohn RG, McGlinn DJ, O'Meara BC, Moles AT, Reich PB, Royer DL, Soltis DE, Stevens PF, Westoby M, Wright IJ, Aarssen L, Bertin RI, Calaminus A, Govaerts R, Hemmings F, Leishman MR, Oleksyn J, Soltis PS, Swenson NG, Warman L, Beaulieu JM, Ordonez A (2013) Data from: Three keys to the radiation of angiosperms into freezing environments. Cite | Share

Je vais donner accès aux groupes datasets et data tools pour garder le tout propre ;) by dishwasherz Jul 27

Un grand ensemble de datasets très variés. A garder de côté ! by simd3v Jul 27

Related: