background preloader

(1) Data: Where can I get large datasets open to the public

Related:  Big Data / Analytics

Machine Learning Repository: Amazon Commerce reviews set Data Set Source: Dataset creator and donator: ZhiLiu, e-mail: liuzhi8673 '@' gmail.com, institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China Data Set Information: dataset are derived from the customers’ reviews in Amazon Commerce Website for authorship identification. Attribute Information: attribution includes authors' lingustic style such as usage of digit, punctuation, words and sentences' length and usage frequency of words and so on Relevant Papers: Sanya Liu, Zhi Liu, Jianwen Sun, Lin Liu, 'Application of Synergetic Neural Network in Online Writeprint Identification', JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 5, No. 3, pp. 126 ~ 135, 2011 Jianwen Sun, Zongkai Yang, Pei Wang, Sanya Liu, 'Variable Length Character N-Gram Approach for Online Writeprint Identification,' mines, pp.486-490, 2010 International Conference on Multimedia Information Networking and Security, 2010 Citation Request:

Publicly Available Big Data Sets :: Hadoop Illuminated Public Data sets on Amazon AWS Amazon provides following data sets : ENSEMBL Annotated Gnome data, US Census data, UniGene, Freebase dump Data transfer is 'free' within Amazon eco system (within the same zone) AWS data sets InfoChimps InfoChimps has data marketplace with a wide variety of data sets. InfoChimps market place Comprehensive Knowledge Archive Network open source data portal platform data sets available on datahub.io from ckan.org Stanford network data collection Open Flights Crowd sourced flight data Flight arrival data

Finding Data on the Internet Skip to Content A Community Site for R – Sponsored by Revolution Analytics Home » How to » Finding Data on the Internet Finding Data on the Internet By RevoJoe on October 6, 2011 The following list of data sources has been modified as of 3/18/14. If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. Economics American Economic Ass. Data Science Practice This section contains data sets used in the book "Doing Data Science" by Rachel Schutt and Cathy O'Neil (O'Reilly 2014) Datasets on the book site: Enron Email Dataset: GetGlue (time stamped events: users rating TV shows): Titanic Survival Data Set: Half a million Hubway rides: Finance Government Health Care Gapminder: Machine Learning Networks Science Comments

Machine Learning - Course website Chris Thornton This course teaches the theory and practice of machine learning using a mixture of demos, lectures and labs. Instructions for lab sessions Assessment is based on one programming assignment and an unseen exam. Most of the syllabus material is in the online lecture notes (below), but note-taking and additional reading is strongly advised. The first meeting for the course will be the first lecture in week 1. Your first lab will your first scheduled lab session after the lecture on k-means clustering. Week 1 . Week 2 . k-means clustering agglomerative clustering, cluster hierarchies, centroids pdf . Week 3 . Week 4 . Week 5 . Week 6 . Week 7 . Week 8 . Week 9 . Week 10 . . If you have questions about the material, the best thing is to put a question to me during a lecture. If you prefer, you can approach me at the end of a lecture. If that doesn't work, you can talk to me (or a lab tutor) in your next lab. Don't send me questions by email. There is no single course text. 4.

IT Operations Analytics In the fields of information technology and systems management, IT Operations Analytics (ITOA) is an approach or method applied to application software designed to retrieve, analyze and report data for IT operations. ITOA has been described as applying big data analytics to large datasets where IT operations can extract unique business insights.[1][2] In its Hype Cycle Report, Gartner rated the business impact of ITOA as being ‘high’, meaning that its use will see businesses enjoy significantly increased revenue or cost saving opportunities.[3] By 2017, Gartner predicts that 15% of enterprises will use IT operations analytics technologies to deliver intelligence for both business execution and IT operations.[2] Definition[edit] History[edit] Due the mainstream embrace of cloud computing and the increasing desire for businesses to adopt more Big Data practices, the ITOA industry has grown significantly since 2010. Applications[edit] Types[edit] Tools and ITOA Platforms[edit] See also[edit]

Public Data Sets A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use. Last Modified: Mar 17, 2014 17:51 PM GMT Three NASA NEX datasets are now available, including climate projections and satellite images of Earth. Last Modified: Nov 12, 2013 13:27 PM GMT The Ensembl project produces genome databases for human as well as over 50 other species, and makes this information freely available. Last Modified: Oct 8, 2013 14:38 PM GMT Last Modified: Oct 8, 2013 14:37 PM GMT Human Microbiome Project Data Set Last Modified: Sep 26, 2013 17:58 PM GMT The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available. Last Modified: Jul 18, 2012 16:34 PM GMT Last Modified: Apr 24, 2012 21:18 PM GMT Last Modified: Mar 4, 2012 3:22 AM GMT Last Modified: Feb 15, 2012 2:22 AM GMT Last Modified: Jan 21, 2012 2:12 AM GMT

UCI Machine Learning Repository: Dermatology Data Set Source: Original Owners: 1. Nilsel Ilter, M.D., Ph.D., Gazi University, School of Medicine 06510 Ankara, Turkey Phone: +90 (312) 214 1080 2. Donor: H. Data Set Information: This database contains 34 attributes, 33 of which are linear valued and one of them is nominal. The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the family, and 0 otherwise. The names and id numbers of the patients were recently removed from the database. Attribute Information: Clinical Attributes: (take values 0, 1, 2, 3, unless otherwise indicated) 1: erythema 2: scaling 3: definite borders 4: itching 5: koebner phenomenon 6: polygonal papules 7: follicular papules 8: oral mucosal involvement 9: knee and elbow involvement 10: scalp involvement 11: family history, (0 or 1) 34: Age (linear) Relevant Papers: G. Gisele L. Rafael S. Rafael S.

Data Visualisation: What's the big deal? | Career and Hiring Insights | Aquent The concept of using pictures to understand complex information — especially data — has been around for a very long time, centuries in fact. One of the most cited examples of statistical graphics is Napoleon’s invasion of Russia mapped by Charles Minard. The maps showed the size of the army and the path of Napoleon’s retreat from Moscow. However, as with most things, it’s technology that has truly allowed data visualisation to take the stage and get noticed. It’s no surprise that with big data there’s potential for BIG opportunity (someone pass me the shot glass), but many corporates are genuinely challenged when it comes to: understanding the data they have finding value in it getting the wider business to buy in and just GET IT!!! So how do you tackle this? When I meet with clients and SME’s to talk data, the conversation naturally heads to stakeholder buy-in and the difficulties in gaining full business engagement, and I’m not surprised. By adding a little art to the science!

How to access 100M time series in R in under 60 seconds DataMarket, a portal that provides access to more than 14,000 data sets from various public and private sector organizations, has more than 100 million time series available for download and analysis. (Check out this presentation for more info about DataMarket.) And now with the new package rdatamarket, it's trivially easy to import those time series into R for charting, analysis, or anything. Here's what you need to do: Register an account on DataMarket.com (it's free)Install the rdatamarket package in R with install.packages("rdatamarket")Browse DataMarket.com for a time series of interest (I found this series on unemployment)Copy the URL of the page you're on (the short URL works too, I used " the dmseries function with the URL to extract the time series as a zoo object Here's an example: Created by Pretty R at inside-R.org With this package, you can go from finding interesting data on DataMarket to working with it in R in less than a minute.

50 external machine learning / data science resources and articles Data Science Central 50 external machine learning / data science resources and articles by Vincent Granville Sep 24, 2015 Starred articles are candidates for the picture of the week. A comprehensive list of all past resources is found here. Resources Source: article #3, below Articles Check out our previous selection of articles. DSC Resources Additional Reading

The 70 Online Databases that Define Our Planet Back in April, we looked at an ambitious European plan to simulate the entire planet. The idea is to exploit the huge amounts of data generated by financial markets, health records, social media and climate monitoring to model the planet’s climate, societies and economy. The vision is that a system like this can help to understand and predict crises before they occur so that governments can take appropriate measures in advance. There are numerous challenges here. Nobody yet has the computing power necessary for such a task, neither are there models that will can accurately model even much smaller systems. But before any of that is possible, researchers must gather the economic, social and technological data needed to feed this machine. Today, we get a grand tour of this challenge from Dirk Helbing and Stefano Balietti at the Swiss Federal Institute of Technology in Zurich. These and other pursuits are now producing massive amounts of data, many of which are freely available on the web.

Analytics: Turning a Flood of Data into Valuable Information The benefits that come from data analytics are many — it's helped reduce inmate populations, improve reliability of emergency medical services and reduce traffic fatalities, to name just a few. Though some government agencies are slow to embrace it due to limited capital or sheer intimidation in the face of disparate systems and fragmented technologies, others have taken hold of the proverbial horns and started the process of improving their daily operations by way of the data. And during the California Technology Forum held Aug. 11 in Sacramento, state and local officials delved into the insights gained from the exponential increase of data — and where teams need to focus their energy to turn this flood of data into valuable information. “From that, I understood that big data wasn’t just the amount of data we were talking about," he said. Schmidt said the efforts to put data in the hands of internal and external application developers is a major focus at the state’s innovation lab.

Open Data Figuring Out How IT, Analytics, and Operations Should Work Together A new set of relationships is being formed within companies around how people working in data, analytics, IT, and operations teams work together. Is there a “right” way to structure these relationships? Data and analytics represent a blurring of the traditional lines of demarcation between the scope of IT and the responsibilities of operating divisions. Enter data and analytics, which provide an opportunity for such innovation. Let’s look at four examples of how different corporations responded when faced with this question. The integrated operational data and analytics function. In this case the data and analytics function created its own proprietary consumer data base, stitched together from many sources, and developed a proprietary cloud-based environment that allows it to engage consumers across all their social media platforms. The stand-alone data and analytics service function. The size and investment in this data and analytics function is considerable.

Related: