background preloader

(1) Data: Where can I get large datasets open to the public

Related:  Big Data / Analytics

Machine Learning Repository: Amazon Commerce reviews set Data Set Source: Dataset creator and donator: ZhiLiu, e-mail: liuzhi8673 '@', institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China Data Set Information: dataset are derived from the customers’ reviews in Amazon Commerce Website for authorship identification. Attribute Information: attribution includes authors' lingustic style such as usage of digit, punctuation, words and sentences' length and usage frequency of words and so on Relevant Papers: Sanya Liu, Zhi Liu, Jianwen Sun, Lin Liu, 'Application of Synergetic Neural Network in Online Writeprint Identification', JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 5, No. 3, pp. 126 ~ 135, 2011 Jianwen Sun, Zongkai Yang, Pei Wang, Sanya Liu, 'Variable Length Character N-Gram Approach for Online Writeprint Identification,' mines, pp.486-490, 2010 International Conference on Multimedia Information Networking and Security, 2010 Citation Request:

Publicly Available Big Data Sets :: Hadoop Illuminated Public Data sets on Amazon AWS Amazon provides following data sets : ENSEMBL Annotated Gnome data, US Census data, UniGene, Freebase dump Data transfer is 'free' within Amazon eco system (within the same zone) AWS data sets InfoChimps InfoChimps has data marketplace with a wide variety of data sets. InfoChimps market place Comprehensive Knowledge Archive Network open source data portal platform data sets available on from Stanford network data collection Open Flights Crowd sourced flight data Flight arrival data

Email any web page to any one / UCI Machine Learning Repository Finding Data on the Internet Skip to Content A Community Site for R – Sponsored by Revolution Analytics Home » How to » Finding Data on the Internet Finding Data on the Internet By RevoJoe on October 6, 2011 The following list of data sources has been modified as of 3/18/14. If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. Economics American Economic Ass. Data Science Practice This section contains data sets used in the book "Doing Data Science" by Rachel Schutt and Cathy O'Neil (O'Reilly 2014) Datasets on the book site: Enron Email Dataset: GetGlue (time stamped events: users rating TV shows): Titanic Survival Data Set: Half a million Hubway rides: Finance Government Health Care Gapminder: Machine Learning Networks Science Comments

Machine Learning - Course website Chris Thornton This course teaches the theory and practice of machine learning using a mixture of demos, lectures and labs. Instructions for lab sessions Assessment is based on one programming assignment and an unseen exam. Most of the syllabus material is in the online lecture notes (below), but note-taking and additional reading is strongly advised. The first meeting for the course will be the first lecture in week 1. Your first lab will your first scheduled lab session after the lecture on k-means clustering. Week 1 . Week 2 . k-means clustering agglomerative clustering, cluster hierarchies, centroids pdf . Week 3 . Week 4 . Week 5 . Week 6 . Week 7 . Week 8 . Week 9 . Week 10 . . If you have questions about the material, the best thing is to put a question to me during a lecture. If you prefer, you can approach me at the end of a lecture. If that doesn't work, you can talk to me (or a lab tutor) in your next lab. Don't send me questions by email. There is no single course text. 4.

IT Operations Analytics In the fields of information technology and systems management, IT Operations Analytics (ITOA) is an approach or method applied to application software designed to retrieve, analyze and report data for IT operations. ITOA has been described as applying big data analytics to large datasets where IT operations can extract unique business insights.[1][2] In its Hype Cycle Report, Gartner rated the business impact of ITOA as being ‘high’, meaning that its use will see businesses enjoy significantly increased revenue or cost saving opportunities.[3] By 2017, Gartner predicts that 15% of enterprises will use IT operations analytics technologies to deliver intelligence for both business execution and IT operations.[2] Definition[edit] History[edit] Due the mainstream embrace of cloud computing and the increasing desire for businesses to adopt more Big Data practices, the ITOA industry has grown significantly since 2010. Applications[edit] Types[edit] Tools and ITOA Platforms[edit] See also[edit]

SWF Charts > Buy Free License XML/SWF Charts is free to download and use. The free, unregistered version contains all the features except for: Clicking a chart takes the user to the XML/SWF Charts web site. No displaying charts inside another flash file. Developing and maintaining XML/SWF Charts takes a lot of effort. Web site developers may use unregistered copies of XML/SWF Charts in client web sites. Software developers may redistribute unregistered copies of XML/SWF Charts within other software products, with the copyright attached. $29 - Single License The single license is for one domain name, all its sub-domains (,,,, etc.), all its ports (,, etc.), and for localhost ( License for one domain name, all its sub-domains and ports, and "localhost". Make a payment with PayPal, and get a registration code at the end of the payment process. $399 - Bulk License

CVonline: Image Databases Index by Topic Another helpful site is the YACVID page. Action Databases Biological/Medical Face Databases Fingerprints General Images General RGBD and Depth Datasets BigBIRD - 100 objects with for each object, 600 3D point clouds and 600 high-resolution color images spanning all views (Singh, Sha, Narayan, Achim, Abbeel) CAESAR Civilian American and European Surface Anthropometry Resource Project - 4000 3D human body scans (SAE International) CIN 2D+3D object classification dataset - segmented color and depth images of objects from 18 categories of common household and office objects (Björn Browatzki et al) Cornell-RGBD-Dataset - Office Scenes (Hema Koppula) IMPART multi-view/multi-modal 2D+3D film production dataset - LIDAR, video, 3D models, spherical camera, RGBD, stereo, action, facial expressions, etc. Gesture Databases Image, Video and Shape Database Retrieval Object Databases People, Pedestrian, Eye/Iris, Template Detection/Tracking Databases 3D KINECT Gender Walking data base (L. Textures

Public Data Sets A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use. Last Modified: Mar 17, 2014 17:51 PM GMT Three NASA NEX datasets are now available, including climate projections and satellite images of Earth. Last Modified: Nov 12, 2013 13:27 PM GMT The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available. Last Modified: Oct 8, 2013 14:38 PM GMT Last Modified: Oct 8, 2013 14:37 PM GMT Human Microbiome Project Data Set Last Modified: Sep 26, 2013 17:58 PM GMT The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available. Last Modified: Jul 18, 2012 16:34 PM GMT Last Modified: Apr 24, 2012 21:18 PM GMT Last Modified: Mar 4, 2012 3:22 AM GMT Last Modified: Feb 15, 2012 2:22 AM GMT Last Modified: Jan 21, 2012 2:12 AM GMT

UCI Machine Learning Repository: Dermatology Data Set Source: Original Owners: 1. Nilsel Ilter, M.D., Ph.D., Gazi University, School of Medicine 06510 Ankara, Turkey Phone: +90 (312) 214 1080 2. Donor: H. Data Set Information: This database contains 34 attributes, 33 of which are linear valued and one of them is nominal. The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the family, and 0 otherwise. The names and id numbers of the patients were recently removed from the database. Attribute Information: Clinical Attributes: (take values 0, 1, 2, 3, unless otherwise indicated) 1: erythema 2: scaling 3: definite borders 4: itching 5: koebner phenomenon 6: polygonal papules 7: follicular papules 8: oral mucosal involvement 9: knee and elbow involvement 10: scalp involvement 11: family history, (0 or 1) 34: Age (linear) Relevant Papers: G. Gisele L. Rafael S. Rafael S.

Data Visualisation: What's the big deal? | Career and Hiring Insights | Aquent The concept of using pictures to understand complex information — especially data — has been around for a very long time, centuries in fact. One of the most cited examples of statistical graphics is Napoleon’s invasion of Russia mapped by Charles Minard. The maps showed the size of the army and the path of Napoleon’s retreat from Moscow. However, as with most things, it’s technology that has truly allowed data visualisation to take the stage and get noticed. It’s no surprise that with big data there’s potential for BIG opportunity (someone pass me the shot glass), but many corporates are genuinely challenged when it comes to: understanding the data they have finding value in it getting the wider business to buy in and just GET IT!!! So how do you tackle this? When I meet with clients and SME’s to talk data, the conversation naturally heads to stakeholder buy-in and the difficulties in gaining full business engagement, and I’m not surprised. By adding a little art to the science!

Datasets per Topic - TC-11 Description: This collection contains table structure ground truth data (rows, columns, cells etc) for document images containing tables in the UNLV and UW3 datasets. The ground truth that we provide is stored in XML format which stores row, column boundaries, bounding boxes of cells and additional attributes such as row-spanning column-spanning cells.The XML ground truth files have the same basename as the name of the corresponding image in the respective dataset. These XML files can then be used to generate color encoded ground truth images in PNG format which can be directly used by the pixel accurate benchmarking framework described in [1]. We used the T-Truth tool, also provided below, to prepare ground truth information. Tables in UNLV dataset: The original dataset contains 2889 pages of scanned document images from variety of sources (Magazines, News papers, Business Letter, Annual Report etc). Tables in UW3 Dataset: [1].

How to access 100M time series in R in under 60 seconds DataMarket, a portal that provides access to more than 14,000 data sets from various public and private sector organizations, has more than 100 million time series available for download and analysis. (Check out this presentation for more info about DataMarket.) And now with the new package rdatamarket, it's trivially easy to import those time series into R for charting, analysis, or anything. Here's what you need to do: Register an account on (it's free)Install the rdatamarket package in R with install.packages("rdatamarket")Browse for a time series of interest (I found this series on unemployment)Copy the URL of the page you're on (the short URL works too, I used " the dmseries function with the URL to extract the time series as a zoo object Here's an example: Created by Pretty R at With this package, you can go from finding interesting data on DataMarket to working with it in R in less than a minute.