background preloader

Career of the Future: Data Scientist [INFOGRAPHIC]

Career of the Future: Data Scientist [INFOGRAPHIC]
Want a job where the talent is scarce — and likely to remain that way for at least the next five years? Become a data scientist. That, at least, is the conclusion of a global survey of the number-crunching professionals by IT service company EMC. Some 63% of data scientists say the profession is going to be undermanned for the foreseeable future — and half of those see it as a serious shortage. But not all of them will have the capacity to turn that raw data into anything useful. "Data is the new oil," says Andreas Weigend, Head of the Social Data Lab at Stanford and the former Chief Scientist at Amazon, in a statement. Check out the rest of the survey data in the detailed inforgraphic below — and let us know in the comments if this is a career you'd like to pursue.

Enterprise Software Doesn't Have to Suck: Data Analysis Training I'm training some of my colleagues on Big'ish data analysis this week. Here's how I'm running the class. Would love your ideas to make it better. After completion of the course, you will be able to: Understand concepts of data science, related processes, tools, techniques and path to building expertiseUse Unix command line tools for file processing (awk, sort, paste, join, gunzip, gzip)Use Excel to do basic analysis and plotsWrite and understand R code (data structures, functions, packages, etc.)Explore a new dataset with ease (visualize it, summarize it, slice/dice it, answer questions related to dataset)Plot charts on a dataset using R Good knowledge of basic statistics (min, max, avg, sd, variance, factors, quantiles/deciles, etc.)Familiarity with Unix OSCLASS TOPICS A) Intro to data science Explain data science and its importance. B) Steps in data science C) Skills needed for data science D) Learning R We will pick a tool to learn the concepts of data science. Tutorials BooksTBD

3 New Tools for Promoting Your Interests. The Spark of Genius Series highlights a unique feature of startups and is made possible by Microsoft BizSpark. If you would like to have your startup considered for inclusion, please see the details here. Each weekend, Mashable selects startups we think are building interesting, unique or niche products. This week we focused on three companies creating new ways to promote your interests. SeeJoeRock is a network for musicians and music industry professionals to connect. SeeJoeRock: A Musician's Free Community Quick Pitch: SeeJoeRock is a free community for musicians. Genius Idea: Bridges the gap between unsigned musicians and professionals in the music industry. Mashable's Take: Described at the "eHarmony for musicians," SeeJoeRock is a network for unsigned musicians and music industry professionals to connect. Unsigned, independent musicians and bands can create their own profiles and include bios, genre, talents, levels of experience and instruments played. Image courtesy of Flickr, JPott

R, the Software, Finds Fans in Data Analysts Left, Stuart Isett for The New York Times; right, Kieran Scott for The New York Times R first appeared in 1996, when the statistics professors Robert Gentleman, left, and Ross Ihaka released the code as a free software package. R is also the name of a popular programming language used by a growing number of data analysts inside corporations and academia. But R has also quickly found a following because statisticians, engineers and scientists without computer programming skills find it easy to use. “R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. It is also free. R is similar to other programming languages, like C, Java and Perl, in that it helps people perform a wide variety of computing tasks by giving them access to various commands. Close to 1,600 different packages reside on just one of the many Web sites devoted to R, and the number of packages has grown exponentially.

Career Advice: How do I become a data scientist Citelighter Is Like a Highlighter for the Internet The Spark of Genius Series highlights a unique feature of startups and is made possible by Microsoft BizSpark. If you would like to have your startup considered for inclusion, please see the details here. Name: Citelighter Quick Pitch: Citelighter keeps information from multiple web sources in one place. Genius Idea: Making it easy to compile notes — and citations — from multiple web pages. For better or worse, students use the Internet for research. Citelighter attempts to solve this conundrum by giving students a note-collecting toolbar that sticks with them as they navigate the web. Through marketing partnerships with well-targeted websites such as CollegeHumor and Frat Music, the company says it has signed up students at 1,000 different universities since it launched in August 2010. In the current version, users are restricted to collecting information on websites through a Firefox extension. Citelighter is simple and easy to use. Series Supported by Microsoft BizSpark

BI at large scale As more and more data being collected everywhere from pretty much everything a user do, such as transactions activities, social interactions, information search ... enterprises has been actively looking into ways to turn these vast amount of raw data into useful information. BI process flow It include the following stages of processing On the other hand, massively parallel processing platform such as Hadoop, Map/Reduce, over the last few years, has been proven in processing Terabyte or even Petabyte range of data. Approach 1: Apache MahoutOne approach is to "re-implement" the ML algorithm in Map/Reduce and this is the path of Apache Mahout project. Approach 2: Ensemble of parallel independent learnersThis is an alternative path that doesn't require re-implementation of existing algorithms. I also found this approach can smoothly fade out outdated model. One gotchas of sampling approach is the handling of rare events (since you may lost those rare events in sampling).

Syllabus — stats202 1.0 documentation Announcements Dec 11, 2013: The final exam grades are available on Coursework. The solutions to the final exam are now available here. Dec 7, 2013: Grade statistics are now available here. Dec 3, 2013: The Kaggle deadline has changed to Friday, December 6 on the website. Nov 4, 2013: Your gradebook is now available in our Coursework site. Oct 31, 2013: You may download the solutions to the midterm. Oct 13, 2013: Please send all regrade requests to the graders at stats202-aut1314-graders@lists.stanford.edu. Your full name and SUNet ID.The homework and problem number.The number of points that you lost.A brief justification of why you think the grading is incorrect or unfair. Oct 7, 2013: Both exams will be closed-book and closed-notes. Sep 30, 2013: If you have questions about homework or any of the lectures, please use our Piazza forum. Any other questions can he emailed to the staff mailing list: Stats 202 meets MWF 1:15-2:05 pm at Skilling Auditorium (note the location change!). Course description

Use YouTube As a Music Player With Tubalr Chances are you've already used YouTube as a music player. But how much fun was that? You had to return to the page every time a song ended, search for the next one and then load often-crude comments and clutter along with your song. A 23-year-old software developer in Atlanta has fixed the YouTube listening experience with a simple app called Tubalr. It searches YouTube for the top songs from a particular artist and arranges them in a continuous playlist. If you'd like to mix it up, a "similar" option searches related artists on Last.fm and delivers their top videos on YouTube to your playlist. "I was surfing YouTube and found some amazing HD music videos," says creator Cody Stewart, "and I thought it would be a cool idea to play those back to back without having all the other stuff I didn't find interesting — mainly the 10,000s of comments about cats and dogs." Stewart, neither a surfer nor a Tuba player, created the app in order to show off his skills during a job search.

big data Over the last couple years, we see an emerging data storage mechanism for storing large scale of data. These storage solution differs quite significantly with the RDBMS model and is also known as the NOSQL. Some of the key players include ...GoogleBigTable, HBase, HypertableAmazonDynamo, Voldemort, Cassendra, RiakRedisCouchDB, MongoDB These solutions has a number of characteristics in commonKey value storeRun on large number of commodity machinesData are partitioned and replicated among these machinesRelax the data consistency requirement. API model The underlying data model can be considered as a large Hashtable (key/value store). The basic form of API access is The underlying infratructure is composed of large number (hundreds or thousands) of cheap, commoditized, unreliable machines connected through a network. Data partitioning (Consistent Hashing) Since the overall hashtable is distributed across many VNs, we need a way to map each key to the corresponding VN. Data replication

7 Business Analytics Gurus to follow on Twitter Here are seven analytic pros offering commentary on business analytics and related topics on Twitter. From: Gregory Piatetsky-Shapiro Just as I was leaving on a skiing vacation last week, I saw this Information Management slideshow on 7 Business Analytics Gurus You Should Be Following on Twitter, and was pleased to see me included. Here are the 7 Business Analytics gurus (all stats as of Jan 10, 2013): Mike Gualtieri, @mgualtieri, Twitter bio: Forrester Analyst: Big Data, predictive analytics, & emerging technology. Host of TechnoPolitics. Futurist. Information Management summary: Along with podcasts, blogs and traditional research, Forrester Research Analyst Mike Gualtieri brings his "futurist" bent into the analytics fold with consistent entries and RTs on predictive analytics, big data and more. Vincent Granville @analyticbridge Twitter bio: Publisher of the AnalyticBridge newsletter. Chicago, www.gartner.com/AnalystBiography?

How To Find That 1 Thing You Lost Online Argh! What was that video called? Was that on Twitter or Facebook? Where did I save that article? Who was it who made that joke about the Edsel? We've got inboxes over here, inboxes over there, boards here, there, tweets, docs, posts and shares. Greplin: For Finding Your Stuff Greplin is the way I find that one online thing I'm looking for. It can search Gmail, Google Calendar, Google Docs, Google Reader and Google Contacts (as well as the professional Google Apps versions). Some of them you have to unlock by inviting friends. Here's Greplin in action: Yes, you're reading that right. Greplin's premium service is $4.99 a month or $49.99 a year. What About Sensitive Stuff Like Logins & Passwords? User names, passwords, ID and credit card numbers are hard to remember, too, and we need to use them often online. Today I found out about Dashlane, which will do just that. I've taken it for a spin. Dashlane is not quite open to the public, but here's a link for RWW fans to get it now!

The Life of Links: An Interview With the Maker of Kippt The word "bookmark," referring to a saved Web link, is starting to sound old. "Bookmark" has this connotation of turn-of-the-century Web browsers, when there weren't Web-based services for saving things. Your local bookmarks folder was where you kept links you wanted to go back to. These days, we're browsing on multiple devices, and links aren't necessarily "sites," "pages" or "articles" anymore. Links can point to all kinds of things. ReadWriteWeb: How did you decide on the features of Kippt, and how do you distinguish it from other bookmarking services? Jori Lallo: "We didn't actually plan to build a bookmarking service. "We both bought iPads right when they came out, both me and [Kippt designer] Karri [Saarinen]. "It got pretty okay traction for a hack project. Beyond the Chore of Tagging "We both had been opposed to the traditional tagging. "I've found that just plain folders actually work pretty well. What's Wrong With Bookmarking "My girlfriend actually uses Kippt in this way.

Related: