background preloader

Datasets for scientific research

Facebook Twitter

UK Data sets. Graphs. Please contact Christian Sommer for comments and questions, or if you have other data sets.last update April 2010 used for shortest path queries, DIMACS means 9th DIMACS Implementation Challenge - Shortest Paths DBLP graph The DBLP Computer Science Bibliography co-author graph largest connected component Web graph WebGraph by the Laboratory for Web Algorithmics link graph interpreted as undirected graph (in which case it is already connected) Router topology CAIDA's Router-Level Topology Measurements "The [...] data file holds link directions corresponding to the traceroute directions.

" second file (itdk0304_rlinks_undirected), interpreted as undirected graph, largest connected component Citation graph KDD competition, citation graph of the hep-th portion of the arXiv hep-th citations tarball, interpreted as undirected graph, largest connected component Database of Interacting Proteins BioGRID DIMACS format copied from DIMACS. Stanford Large Network Dataset Collection. Social networks Networks with ground-truth communities Communication networks Citation networks Collaboration networks Web graphs Product co-purchasing networks Internet peer-to-peer networks Road networks Autonomous systems graphs Signed networks Location-based online social networks Wikipedia networks, articles, and metadata Temporal networks User Actions Memetracker and Twitter Online Communities Online Reviews Face-to-Face Communication Networks Graph classification datasets Network types Directed : directed network Undirected : undirected network Bipartite : bipartite network Multigraph : network has multiple edges between a pair of nodes Temporal : for each node/edge we know the time when it appeared in the network Labeled : network contains labels (weights, attributes) on nodes and/or edges Network statistics Citing SNAP We encourage you to cite our datasets if you have used them in your work.

UCI Network Data Repository. Data + Design. Running your own study to collect data is not the only or best way to start your data analysis. Using someone else’s dataset and sharing your data is on the rise and has helped advance much of the recent research. Using external data offers several benefits: Where to Find External Data All those benefits sound great! So where do you find external data? To help narrow your search, ask yourself the following questions: Public Data Once you have a better idea of what you’re looking for in an external dataset, you can start your search at one of the many public data sources available to you, thanks to the open content and access movement that has been gaining traction on the Internet. If you decide to use a search engine (like Google) to look for datasets, keep in mind that you’ll only find things that are indexed by the search engine.

If you’re not sure what to do with a particular type of data, try browsing through the Information is Beautiful awards for inspiration. Non-Public Data. Top scoring links : datasets. Top scoring links : data. Recherche personnalisée Google.

To confirm

Academic Torrents. Umbrae/reddit-top-2.5-million. PhysioBank Archive Index. This page lists all currently available databases in the PhysioBank archives, organized according to the types of signals and annotations contained in each database: If you prefer, you can view separate lists of these databases organized by class: Class 1 (completed reference databases) Class 2 (archival copies of raw data that support published research, contributed by authors or journals) Class 3 (other contributed collections of data, including works in progress) We make class 2 and class 3 data available via PhysioNet as a service to the research community.

Contributed data are placed in classes 2 and 3 on acceptance, and may be admitted to class 1 after review and a public comment period. On this page, listings within each group are ordered by class, and then alphabetically by the name of the database. Multi-Parameter Databases These databases include a variety of digitized physiologic signals in each recording. [Class 1] MGH/MF Waveform Database. ECG Databases. Computer Vision Test Images. Alternative Interfaces. Materials for music research — Humanistinen tiedekunta. Machine Learning Repository: Data Sets.

Million Song Dataset | scaling MIR research. The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are: To encourage research on algorithms that scale to commercial sizesTo provide a reference dataset for evaluating researchAs a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The Million Song Dataset is also a cluster of complementary datasets contributed by the community: The Million Song Dataset started as a collaborative project between The Echo Nest and LabROSA.

How to get started To get a sense of the dataset, you can look at this description of one of the million songs. To start your own experiments, you can download the entire dataset (280 GB). We also have a set of suggested tasks, including snippets of code to get you started.