background preloader

Data Science

Facebook Twitter

H2O.ai. Why H2O?

H2O.ai

H2O makes it possible for anyone to easily apply machine learning and predictive analytics to solve today’s most challenging business problems. It intelligently combines unique features not currently found in other machine learning platforms including: Best of Breed Open Source Technology – Enjoy the freedom that comes with big data science powered by open source technology. H2O was written from scratch in Java and seamlessly integrates with the most popular open source products like Apache Hadoop® and Spark™ to give customers the flexibility to solve their most challenging data problems.Easy-to-use WebUI and Familiar Interfaces – Set up and get started quickly using either H2O’s intuitive web-based Flow graphical user interface or familiar programming environments like R, Python, Java, Scala, JSON, and through our powerful APIs.

Glmnet_alpha. Chorus/chorus. MADlib. BigMUD 2013: The First International Workshop on Mining and Understanding from Big Data. In conjunction with IEEE International Conference on Data Mining (ICDM 2013) Dallas, Texas.

BigMUD 2013: The First International Workshop on Mining and Understanding from Big Data

UCI Math77B: Collaborative Filtering - Kaggle in Class. Kiji Community - Build Real-Time Scalable Data Applications on Apache HBase. Welcome to Apache Avro! Declarative Languages And Systems. Twister: Iterative MapReduce. Giraph - Welcome To Apache Giraph! HaLoop Talk. Haloop - An modified version of Hadoop to support efficient iterative data processing on large commodity clusters. Why do we develop the HaLoop project?

haloop - An modified version of Hadoop to support efficient iterative data processing on large commodity clusters

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. However, these new platforms do not have built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph processing, model fitting, and so on.

What is HaLoop? Simply speaking, HaLoop = Ha, Loop:-) HaLoop is a modified version of the Hadoop MapReduce framework, designed to serve these applications. Big Data Discovery. Experience Data-Driven Applications. Created by Camtasia Studio 6. Created by Camtasia Studio 6. Refine - Google Refine, a power tool for working with messy data (formerly Freebase Gridworks) User Guide · OpenRefine/OpenRefine Wiki. How to use OpenRefine If you haven't done so already, we strongly suggest you to watch the screencasts first as they will give you an idea of how to use OpenRefine.

User Guide · OpenRefine/OpenRefine Wiki

The Basics First, although OpenRefine might start out looking like a spreadsheet program (Microsoft Excel, Google Spreadsheets, etc.), don't expect it to work like a spreadsheet program. That's almost like expecting a database to work like a text editor. OpenRefine is NOT for entering new data one cell at a time. OpenRefine is for applying transformations over many existing cells in bulk, for the purpose of cleaning up the data, extending it with more data from other sources, and getting it to some form that other tools can consume. To use OpenRefine, think in big patterns. Www.stat.ufl.edu/~casella/Talks/BayesRefresher.pdf. The curse of big data. This seminal article highlights the dangers of reckless applications and scaling of data science techniques that have worked well for small, medium-size and large data.

The curse of big data

We illustrate the problem with flaws in big data trading, and propose solutions. Also, we believe expert data scientists are more abundant (but very different) than what hiring companies claim: read our "related articles" section at the bottom for more details. This article is written in simple English, is very short and contains both high level material for decision makers, as well as deep technical explanations when needed. In short, the curse of big data is the fact that when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power - even worse, the strongest patterns might be The questions is: how do you discriminate between a real and accidental signal in vast amounts of data?

Visualization Gallery - CS294-10 Visualization Sp11. What is Hbase. Apache Hadoop is an excellent framework for processing, storing and analyzing large volumes of unstructured data - aka Big Data.

What is Hbase

But getting a handle on all the project’s myriad components and sub-components, with names like Pig and Mahout, can be a difficult. Cloudera Co-Founder and CTO Amr Awadallah walked the Wikibon Community through the Hadoop ecosystem during a visit to #theCUBE at Hadoop World 2011 in New York City. Below is a glossary describing the key Hadoop components and sub-components, as defined both by Awadallah and Wikibon, as well as the live recording of Awadallah inside #theCUBE from the show floor.

BigQuery. BigQuery. Home - The Big Data Landscape. HBase vs Cassandra. Www.cs.utexas.edu/users/lorenzo/corsi/cs380d/past/03F/notes/paxos-simple.pdf. Thanks for Trying R: Free Data Resources. Visualization of the Week: On tour with the Rolling Stones for 50 years By Jenn Webb December 5, 2012 The Rolling Stones have reached their 50th anniversary milestone.

Thanks for Trying R: Free Data Resources

The band celebrated by kicking off an anniversary tour, and the team over at CartoDB took the opportunity to test their new CartoDB Javascript library and visualize The Rolling Stones’ complete … A change is gonna come By Ann Spencer December 4, 2012 When I told some of my friends and family that I was joining O’Reilly Media as an editor focusing on ORM’s Strata practice area, their responses reflected the diversity of my loved ones.

Home. Data.gov.uk. Processing.org. Getting In Shape For The Sport Of Data Science. Getting Started With Python For Data Science. Who is this for and what will I learn?

Getting Started With Python For Data Science

This tutorial assumes some knowledge of Python and programming, but no knowledge whatsoever of data science, machine learning, or predictive modeling (or, heck, even statistics). To the extent there is a target audience, it's probably hacker types who learn best by doing. All the code from this tutorial is available on github . You might encounter terms you're not familiar with, but that shouldn't stop you from completing the tutorial. By the end, you won't know a heck of a lot more about data science per se , but you'll have a nice environment set up where you can easily play with lots of different data science tools and even make credible entries to Kaggle competitions.

Tutorial: scikit-learn - Machine Learning in Python with Contributor Jake VanderPlas. Data analysis in Python with pandas. 2012 PyData Workshop: Data Analysis in Python with Pandas. Machine Learning/Decision Trees/C4.5 Tutorial. References : P.

Machine Learning/Decision Trees/C4.5 Tutorial

Winston, 1992. C4.5 is a software extension of the basic ID3 algorithm designed by Quinlan to address the following issues not dealt with by ID3: Avoiding overfitting the data Determining how deeply to grow a decision tree. Reduced error pruning. Denoising Autoencoders (dA) — DeepLearning 0.1 documentation. The Denoising Autoencoder (dA) is an extension of a classical autoencoder and it was introduced as a building block for deep networks in [Vincent08] .

Denoising Autoencoders (dA) — DeepLearning 0.1 documentation

We will start the tutorial with a short discussion on Autoencoders . Autoencoders See section 4.6 of [Bengio09] for an overview of auto-encoders. CS346, Data Mining, Rule Learning via Sequential Covering. CS346, Data Mining Prof. Alvarez Learning Rules by Sequential Covering Rules provide models of data that people find intuitive. Therefore, data mining techniques that produce rules can be of interest when the results will be used and interpreted by people. One option is to extract rules from decision trees: each branch in the tree corresponds to a rule that has its leaf label as the consequent and the conjunction of the attribute-value pairs along the path as the antecedent.

Sequential Covering Pseudocode. Elastic Scale-Out. Machine Learning and Data Mining: 04 Association Rule Mining. Database Research Group. Seminar - DAIS: Data and Information Systems Research Laboratory - University of Illinois - Engineering Wiki. The Yahoo! -DAIS Seminar will be held on Wednesday at 11AM in 3405 SC. As in other semesters, we will have a few visiting speakers who must be scheduled at a different day or time, due to their travel schedules. Students who take the DAIS Seminar for credit can miss up to two regular seminars. Speakers are announced on the DAIS mailing list (as are other items of interest to the DAIS community).

It is quick and easy to subscribe to the DAIS mailing list . IlliMine. Home. Welcome to the Kaggle Public Wiki, a new support resource for competitive data science and machine learning. Be bold with your changes, and together let's build an amazing, approachable library of knowledge. The Kaggle Public Wiki is a resource for learning statistics, machine learning and other data science concepts, with a strong focus on the practical application of those skills in a competitive environment. It will help participants in Kaggle competitions understand the structure of data science competitions and the metrics they are scored on, teach tactics and strategy for attacking different types of problems, and track the development of the newest cutting edge statistical and machine learning techniques. Kaggle In Class Wiki Syntax The format of the wiki syntax we use, which is based on Markdown. Wiki Design Principles Why the software for this wiki works the way it does, and our general development roadmap.

Hadoop Tutorial. Introduction HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. This module introduces the design of this distributed file system and instructions on how to operate it.

Eigenfactor. The Joy of Stats. About the video Hans Rosling says there’s nothing boring about stats, and then goes on to prove it. A one-hour long documentary produced by Wingspan Productions and broadcast by BBC, 2010. A DVD is available to order from Wingspan Productions.

Readings

Neural Network for Recognition of Handwritten Digits. Download the Neural Network demo project - 203 Kb (includes a release-build executable that you can run without the need to compile) Download a sample neuron weight file - 2,785 Kb (achieves the 99.26% accuracy mentioned above) Download the MNIST database - 11,594 Kb total for all four files (external link to four files which are all required for this project)

MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges. Of handwritten digits. Yann LeCun's Home Page. Guidelines for Data De-Identification or Anonymization - Information Security Guide - Internet2 Wiki. Version 1.1: July 2010 NOTE: For the purposes of this document, although there are subtle differences in their definitions, "de-identification" and "anonymization" will be considered synonymous terms. These terms refer to situations where personally identifying information is removed from data sets in order to protect a person's individual privacy.

Hadoop-as-a-Service + open source framework. One Part Open Framework Your Tools, on Hadoop. Interactive Hadoop. Static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012. Global Leader in Data Warehousing, Big Data Analytic Technologies & Data Driven Marketing - Teradata. Introduction to JSMapreduce. JSMapreduce. Getting Started With Orange - sdakak. Introduction. Orange. Jeff Hawkins on Artificial Intelligence - Part 1/5. Machine Learning and Language - Structure:Data 2013. Hadoop-as-a-Service + open source framework. Hadoop Connect.

HTM

Jeff Hawkins - Hierarchical Temporal Memory. Www-edlab.cs.umass.edu/cs691jj/hawkins-and-george-2006.pdf. Docs.mohammadzadeh.info/Projects/PR/HandwrittenDigitRecognition/References/9-2007-Handwritten Digit Recognition using Hierarchical Temporal Memory.pdf. Www.wseas.us/e-library/conferences/2010/Faro/DNCOCO/DNCOCO-08.pdf. Lesson 7.2 Bayesian Network Classifiers. GeNIe and SMILE - Downloads. We can compile SMILE for different platforms/specifications on request. Please email us for details. Distributions for Visual C++ contain three sets of libraries: two optimized builds for static and dynamic CRT, and another unoptimized build linking against debug DLL runtime without NDEBUG defined. smile.h and smilearn.h contain autolink #pragma directives and will select correct library, depending on CRT configuration of your project.

Tutorial on Bayesian Networks with Netica. 1. What is a Bayes net? A Bayes net is a model. It reflects the states of some part of a world that is being modeled and it describes how those states are related by probabilities. OpenMarkov. SamIam - Sensitivity Analysis, Modeling, Inference and More. Command Line Shell For SQLite. Small. Fast. JSMapreduce. Www-stat.stanford.edu/~jhf/ftp/prim.pdf.

Secs.ceas.uc.edu/~mazlack/dbm.sp2008/Silverstein_Craig.pdf. Documentation. Pulse of the Nation: U.S. Mood Throughout the Day inferred from Twitter. Redirecting. Moods on Twitter Follow Biological Rhythms, Study Finds. MAD Skills: New Analysis Practices for Big Data. GeoNames. YAGO - D5: Databases and Information Systems (Max-Planck-Institut für Informatik)

About WordNet - WordNet - About WordNet. OpenCyc. S4: Distributed Stream Computing Platform. Giraph - Welcome To Apache Giraph. Hyracks - Hyracks is a data parallel platform to run data-intensive jobs on a cluster of shared-nothing machines. Twister: Iterative MapReduce. Haloop - An modified version of Hadoop to support efficient iterative data processing on large commodity clusters.

MADlib. MongoDB. HBase - Apache HBase™ Home. Introduction to MapReduce: When to consider MapReduce for your data, Running MapReduce on your local workstation in less than 10 minutes, Understanding the distributed and parallel nature of the MapReduce architecture.