background preloader

Career Advice: How do I become a data scientist

Career Advice: How do I become a data scientist

Enterprise Software Doesn't Have to Suck: Data Analysis Training I'm training some of my colleagues on Big'ish data analysis this week. Here's how I'm running the class. Would love your ideas to make it better. After completion of the course, you will be able to: Understand concepts of data science, related processes, tools, techniques and path to building expertiseUse Unix command line tools for file processing (awk, sort, paste, join, gunzip, gzip)Use Excel to do basic analysis and plotsWrite and understand R code (data structures, functions, packages, etc.)Explore a new dataset with ease (visualize it, summarize it, slice/dice it, answer questions related to dataset)Plot charts on a dataset using R Good knowledge of basic statistics (min, max, avg, sd, variance, factors, quantiles/deciles, etc.)Familiarity with Unix OSCLASS TOPICS A) Intro to data science Explain data science and its importance. B) Steps in data science C) Skills needed for data science D) Learning R We will pick a tool to learn the concepts of data science. Tutorials BooksTBD

1.2 Sample ACF and Properties of AR(1) Model | STAT 510 - Applied Time Series Analysis Printer-friendly version This lesson defines the sample autocorrelation function (ACF) in general and derives the pattern of the ACF for an AR(1) model. Recall from Lesson 1.1 for this week that an AR(1) model is a linear model that predicts the present value of a time series using the immediately prior value in time. Definition: Let xt denote the value of a time series at time t. \frac{\text{Covariance}(x_t, x_{t-h})}{\text{Std.Dev.} The denominator in the second formula occurs because the standard deviation of the series is the same at all times. Stationary Series As a preliminary, we define an important concept, that of a stationary series. Definition: A series xt is said to be (weakly) stationary if it satisfies the following properties: The mean E(xt) is the same for all t.The variance of xt is the same for all t.The covariance (and also correlation) between xt and xt-h is the same for all t. Many stationary series have recognizable patterns for their ACF and PACF. Assumptions: Lag. Mean:

Career of the Future: Data Scientist [INFOGRAPHIC] Want a job where the talent is scarce — and likely to remain that way for at least the next five years? Become a data scientist. That, at least, is the conclusion of a global survey of the number-crunching professionals by IT service company EMC. Some 63% of data scientists say the profession is going to be undermanned for the foreseeable future — and half of those see it as a serious shortage. But not all of them will have the capacity to turn that raw data into anything useful. "Data is the new oil," says Andreas Weigend, Head of the Social Data Lab at Stanford and the former Chief Scientist at Amazon, in a statement. Check out the rest of the survey data in the detailed inforgraphic below — and let us know in the comments if this is a career you'd like to pursue.

BI at large scale As more and more data being collected everywhere from pretty much everything a user do, such as transactions activities, social interactions, information search ... enterprises has been actively looking into ways to turn these vast amount of raw data into useful information. BI process flow It include the following stages of processing On the other hand, massively parallel processing platform such as Hadoop, Map/Reduce, over the last few years, has been proven in processing Terabyte or even Petabyte range of data. Although many sequential algorithm can be restructured to run in map reduce, including a big portion of machine learning algorithm, there isn't a corresponding parallel implementation of ML available in massively parallel form. Approach 1: Apache MahoutOne approach is to "re-implement" the ML algorithm in Map/Reduce and this is the path of Apache Mahout project. I also found this approach can smoothly fade out outdated model.

ONLINE OPEN-ACCESS TEXTBOOKS Search form You are here Forecasting: principles and practice Rob J Hyndman George Athana­sopou­los Statistical foundations of machine learning Gianluca Bontempi Souhaib Ben Taieb Electric load forecasting: fundamentals and best practices Tao Hong David A. Modal logic of strict necessity and possibility Evgeni Latinov Applied biostatistical analysis using R Stephen B. Introduction to Computing : Explorations in Language, Logic, and Machines David Evans R, the Software, Finds Fans in Data Analysts Left, Stuart Isett for The New York Times; right, Kieran Scott for The New York Times R first appeared in 1996, when the statistics professors Robert Gentleman, left, and Ross Ihaka released the code as a free software package. R is also the name of a popular programming language used by a growing number of data analysts inside corporations and academia. It is becoming their lingua franca partly because data mining has entered a golden age, whether being used to set ad prices, find new drugs more quickly or fine-tune financial models. Companies as diverse as Google, Pfizer, Merck, Bank of America, the InterContinental Hotels Group and Shell use it. But R has also quickly found a following because statisticians, engineers and scientists without computer programming skills find it easy to use. “R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. It is also free.

Syllabus — stats202 1.0 documentation Announcements Dec 11, 2013: The final exam grades are available on Coursework. The average was 79 with a standard deviation of 13. You can pick up your graded papers from my office on Friday from noon until 2pm, or by appointment next week. SCPD students will be sent their exams back tomorrow. The solutions to the final exam are now available here. Dec 7, 2013: Grade statistics are now available here. Dec 3, 2013: The Kaggle deadline has changed to Friday, December 6 on the website. Nov 4, 2013: Your gradebook is now available in our Coursework site. Oct 31, 2013: You may download the solutions to the midterm. Oct 13, 2013: Please send all regrade requests to the graders at stats202-aut1314-graders@lists.stanford.edu. Your full name and SUNet ID.The homework and problem number.The number of points that you lost.A brief justification of why you think the grading is incorrect or unfair. Oct 7, 2013: Both exams will be closed-book and closed-notes. Sep 30, 2013: Course description Staff and office hours

Data Mining Research - www.dataminingblog.com big data Over the last couple years, we see an emerging data storage mechanism for storing large scale of data. These storage solution differs quite significantly with the RDBMS model and is also known as the NOSQL. Some of the key players include ...GoogleBigTable, HBase, HypertableAmazonDynamo, Voldemort, Cassendra, RiakRedisCouchDB, MongoDB These solutions has a number of characteristics in commonKey value storeRun on large number of commodity machinesData are partitioned and replicated among these machinesRelax the data consistency requirement. (because the CAP theorem proves that you cannot get Consistency, Availability and Partitioning at the the same time) The aim of this blog is to extract the underlying technologies that these solutions have in common, and get a deeper understanding on the implication to your application's design. I am not intending to compare the features of these solutions, nor to suggest which one to use. API model The basic form of API access is Data replication

7 Business Analytics Gurus to follow on Twitter Here are seven analytic pros offering commentary on business analytics and related topics on Twitter. From: Gregory Piatetsky-Shapiro Just as I was leaving on a skiing vacation last week, I saw this Information Management slideshow on 7 Business Analytics Gurus You Should Be Following on Twitter, and was pleased to see me included. Here are the 7 Business Analytics gurus (all stats as of Jan 10, 2013): Mike Gualtieri, @mgualtieri, Twitter bio: Forrester Analyst: Big Data, predictive analytics, & emerging technology. Host of TechnoPolitics. Information Management summary: Along with podcasts, blogs and traditional research, Forrester Research Analyst Mike Gualtieri brings his "futurist" bent into the analytics fold with consistent entries and RTs on predictive analytics, big data and more. Vincent Granville @analyticbridge Twitter bio: Publisher of the AnalyticBridge newsletter. Chicago, www.gartner.com/AnalystBiography?

cs109/content

Related: