The One-Stop Shop for Big Data Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper. Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining. What are we waiting for? Let’s get started! Here are the algorithms:
5 lessons we learned about data science in 2013 - VentureBeat How can big data and smart analytics tools ignite growth for your company? Find out at DataBeat, May 19-20 in San Francisco, from top data scientists, analysts, investors, and entrepreneurs. Register now and save $200! Most people know what marketing executives do every day. They try to catch people’s attention through email, ads, tweets, and press releases. As for data scientists, well, their work is not nearly as well understood.
A Practical Intro to Data Science — Zipfian Academy - Data Science Bootcamp Are you a interested in taking a course with us? Learn more on our programs page or contact us. There are plenty of articles and discussions on the web about what data science is, what qualities define a data scientist, how to nurture them, and how you should position yourself to be a competitive applicant. MapReduce and Spark About a week ago, I posted an article on Cloudera’s strategy on SQL in the Apache Hadoop ecosystem. In the article, I argued that a special-purpose distributed query processing engine will perform better than one that translates work into a general-purpose MapReduce framework, even if MapReduce is improved to trim latency and improve throughput. Notwithstanding that bet, we simultaneously believe that the ecosystem needs a high-performance alternative to the current MapReduce implementation.
Random forest The selection of a random subset of features is an example of the random subspace method, which, in Ho's formulation, is a way to implement classification proposed by Eugene Kleinberg. History The early development of random forests was influenced by the work of Amit and Geman which introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho was also influential in the design of random forests. In this method a forest of trees is grown, and variation among the trees is introduced by projecting the training data into a randomly chosen subspace before fitting each tree.
The Analytics Maturity Spectrum There is no doubt “Big Data” has taken the tech world by storm. I have spent much of 2013 talking about analytics and data science with people all around the US, going to conferences like Strata, and immersing myself in this world for the last 12 months. Over the course of this journey, I have started to notice some patterns about how various people in various kinds of organizations understand and invest in analytics. The analytics led company is a concept I will define here as a company that seeks to use analytics (predictive, prescriptive, or descriptive) as one of their chief competitive weapons. The canonical example is Amazon, whose use of analytics is part of the DNA of the company.
Heat map Heat map generated from DNA microarray data reflecting gene expression values in several conditions Heat maps originated in 2D displays of the values in a data matrix. Larger values were represented by small dark gray or black squares (pixels) and smaller values by lighter squares. Scala as a platform for statistical computing and data science A feature wish list It should: The not-very-surprising punch-line is that Scala ticks all of those boxes and that I don’t know of any other languages that do. But before expanding on the above, it is worth noting a couple of (perhaps surprising) omissions. For example: Starting Your Big Data Lab for a POC In continuation of my previous blog post, “6 Steps to Start Your Big Data Journey,” I want to address here the question “How should you start your big data journey?” What is the Big Data Lab? The Big Data Lab is a dedicated development environment, within your current technology infrastructure, that can be created explicitly for experimentation with emerging technologies and approaches to big data and analytics.