Using Excel to do precision journalism, an Update from the School of Data Journalism in Perugia. Our first workshop has just kicked off with Steve Doig leading “Excel for Journalists”. If you missed it, don’t worry – here’s the breakdown for you! Download the Data and the Tutorial You can download the full data for this tutorial and a text version of the tutorial itself via Steve’s website. The Tutorial A tutorial by Steve Doig, journalism professor at ASU's Cronkite School and Pulitzer-winning data journalist, based on his workshop, Excel for Journalists. Microsoft Excel is a powerful tool that will handle most tasks that are useful for a journalist who needs to analyze data to discover interesting patterns. Sorting Filtering Using math and text functions Pivot tables Introduction to Excel Excel will handle large amounts of data that is organized in table form, with rows and columns. Modern versions of Excel will hold as many as 1,048,576 records with as many as 16,384 variables! Sorting One of the most useful abilities of Excel is to sort the data into a more revealing order. Filtering
66 job interview questions for data scientists We are now at 91 questions. We've also added 50 new ones here, and started to provide answers to these questions here. These are mostly open-ended questions, to assess the technical horizontal knowledge of a senior candidate for a rather high level position, e.g. director. What is the biggest data set that you processed, and how did you process it, what were the results? Related articles: Previous digest | Recent jobs | Top Links | Data Science eBook
Data Science Wars: Python vs. R As I frequently travel in data science circles, I’m hearing more and more about a new kind of tech war: Python vs. R. I’ve lived through many tech wars in the past, e.g. Windows vs. Linux, iPhone vs. While R has traditionally been the programming language of choice for data scientists, some believe it is ceding ground to Python. R is Too Complex The most frequently stated argument I’ve heard is the view that Python is general purpose and comparatively easy to learn whereas R remains a somewhat complex programming environment to master. When I first learned R, I did not find it particularly complex; it was a lot easier for me to learn R than C++ or Java with their mammoth frameworks. R Isn’t Really a Language Another argument says that part of the reason people struggle to learn R is that it’s not really a language. Python is More Approachable Some feel that Python is more approachable. Remember, R is a very old statistical environment that has an incredible global following.
Scraping websites using the Scraper extension for Chrome If you are using Google Chrome there is a browser extension for scraping web pages. It’s called “Scraper” and it is easy to use. It will help you scrape a website’s content and upload the results to google docs. Walkthrough: Scraping a website with the Scraper extension Open Google Chrome and click on Chrome Web StoreSearch for “Scraper” in extensionsThe first search result is the “Scraper” extensionClick the add to chrome button.Now let’s go back to the listing of UK MPsOpen mark the entry for one MP Right click and select “scrape similar…” A new window will appear – the scraper console In the scraper console you will see the scraped contentClick on “Save to Google Docs…” to save the scraped content as a Google Spreadsheet. Walkthrough: extended scraping with the Scraper extension Note: Before beginning this recipe – you may find it useful to understand a bit about HTML. Easy wasn’t it?
How to Interview a Data Scientist Detecting multicollinearity using variance inflation factors | STAT 501 - Regression Methods Printer-friendly version Okay, now that we know the effects that multicollinearity can have on our regression analyses and subsequent conclusions, how do we tell when it exists? That is, how can we tell if multicollinearity is present in our data? Some of the common methods used for detecting multicollinearity include: The analysis exhibits the signs of multicollinearity — such as, estimates of the coefficients vary from model to model. Looking at correlations only among pairs of predictors, however, is limiting. What is a variation inflation factor? As the name suggests, a variance inflation factor (VIF) quantifies how much the variance is inflated. Let's be a little more concrete. it can be shown that the variance of the estimated coefficient bk is: Note that we add the subscript "min" in order to denote that it is the smallest the variance can be. Let's consider such a model with correlated predictors: How much larger? An example the matrix plot of BP, Dur, Pulse, and Stress:
@joelmatriche » Le blog de jo Dans une cellule contenant des centaines, sinon des milliers de caractères, comment extraire la portion de texte et les données qui vus intéressent ? Cas concret et explications avec les fonctions STXT et CHERCHE. Le cas concret est le suivant : grâce à OpenRefine, j’ai géocodé à la volée une série de communes. Leurs coordonnées géographiques apparaissent dans une nouvelle colonne mais le problème est qu’elles sont noyées parmi des centaines d’autres informations. Je pourrais nettoyer ces cellules et isoler latitudes et longitudes avec Refine mais j’ai choisi de le faire avec Excel. Voici à quoi ressemblent les cellules dans OpenRefine : Il y a des chiffres, il y a des lettres, le nombre de caractères et leur disposition est différent d’une cellule à l’autre, c’est le foutoir. L’objectif est donc d’isoler, dans des cellules distinctes, ces latitudes et longitudes afin, par exemple, de pouvoir géocoder facilement ces communes. Première étape. Post Tags: Browse Timeline Comments ( 1 Comment )
Getting Started with Python for Data Scientists With the R Users DC Meetup broadening its topic base to include other statistical programming tools, it seemed only reasonable to write a meta post highlighting some of the best Python tutorials and resources available for data science and statistics. What you don’t know is often the hardest part of picking up a new skill, so hopefully these resources will help make learning Python a little easier. Prepare yourself for code indentation heaven. Python is such an incredible language because it can do practically anything, from high performance scientific computing to web frameworks such as Django or Flask. Distributions Python is available for free from and there are two popular versions, 2.7 or 3.x. Commercial distributions are also available that have included and tested various useful packages such as the Enthought Python Distribution. Python Developer Tools Sublime Text2 - If you have never used it, you should try this editor. Learning Python Learn about Packages