background preloader

Readings

Facebook Twitter

Palantir, the War on Terror's Secret Weapon. In October, a foreign national named Mike Fikri purchased a one-way plane ticket from Cairo to Miami, where he rented a condo. Over the previous few weeks, he’d made a number of large withdrawals from a Russian bank account and placed repeated calls to a few people in Syria. More recently, he rented a truck, drove to Orlando, and visited Walt Disney World by himself. As numerous security videos indicate, he did not frolic at the happiest place on earth. He spent his day taking pictures of crowded plazas and gate areas. None of Fikri’s individual actions would raise suspicions. Lots of people rent trucks or have relations in Syria, and no doubt there are harmless eccentrics out there fascinated by amusement park infrastructure. The day Fikri drives to Orlando, he gets a speeding ticket, which triggers an alert in the CIA’s Palantir system. As the CIA analyst starts poking around on Fikri’s file inside of Palantir, a story emerges.

It was an unlikely match. Michael E. MapReduce: A Flexible Data Processing Tool | January 2010. By Jeffrey Dean, Sanjay Ghemawat Communications of the ACM, Vol. 53 No. 1, Pages 72-77 10.1145/1629175.1629198 Comments (3) Mapreduce is a programming model for processing and generating large data sets.4 Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. We built a system around this programming model in 2003 to simplify construction of the inverted index for handling searches at Google.com. Since then, more than 10,000 distinct programs have been implemented using MapReduce at Google, including algorithms for large-scale graph processing, text processing, machine learning, and statistical machine translation. the Hadoop open source implementation of MapReduce has been used extensively outside of Google by a number of organizations.10,11 Back to Top Compared to Parallel Databases We also discuss other important issues: Indices 1.

The Dangers of Overfitting or How to Drop 50 spots in 1 minute. This post was originally published on Gregory Park's blog . Reprinted with permission from the author (thanks Gregory!) Over the last month and a half, the Online Privacy Foundation hosted a Kaggle competition, in which competitors attempted to predict psychopathy scores based on abstracted Twitter activity from a couple thousand users.

One of the goals of the competition is to determine how much information about one’s personality can be extracted from Twitter, and by hosting the competition on Kaggle, the Online Privacy Foundation can sit back and watch competitors squeeze every bit of predictive ability out of the data, trying to predict the psychopathy scores of 1,172 Twitter users. Competitors can submit two sets of predictions each day, and each submission is scored from 0 (worst) to 1 (best) using a metric known as “ average precision “.

Essentially, a submission that predicts the correct ranking of psychopathy scores across all Twitter accounts will receive a score of 1. CS345A: Data Mining. Course Info | Handouts | Assignments | Project | Course Outline | Resources and Reading NEW NEW ROOM: 200-002. This is the big auditorium in the basement of the History Corner. It seats 163, so there should be plenty of room for us to spread out. Instructors: Anand Rajaraman (anand @ kosmix dt com), Jeffrey D. Ullman (ullman @ gmail dt com). TA: Anish Johnson (ajohna @ stanford dt edu). Staff Mailing List: cs345a-win0809-staff@mailman.stanford.edu Meeting: MW 4:15 - 5:30PM; Room: History Corner basement 200-002.

Office Hours:Anand Rajaraman: MW 5:30-6:30pm (after the class in the same room)Jeff Ullman 2-4PM on the days I teach, in 433 Gates. Prerequisites: CS145 or equivalent. Materials: There is no text. Students will use the Gradiance automated homework system for which a fee will be charged. You can see earlier versions of the notes and slides covering Data Mining. There will be assignments of two kinds. Gradiance Assignments Some of the homework will be on the Gradiance system. How Vertica Was the Star of the Obama Campaign, and Other Revelations | CITO Research.

The 2012 Obama re-election campaign has important implications for organizations that want to make better use of big data. The hype about its use of data is certainly justified, but a lesser-noticed aspect of the campaign ran against another kind of data hype we’ve all heard: the Silicon Valley hype around Hadoop that goes too far and claims an unreasonably large role for Hadoop.

One of the most critical contributors to the Obama campaign’s success was the direct access it had to a massive database of voter data stored in Vertica. The Obama campaign did have Hadoop running in the background, doing the noble work of aggregating huge amounts of data, but the biggest win came from good old SQL on a Vertica data warehouse and from providing access to data to dozens of analytics staffers who could follow their own curiosity and distill and analyze data as they needed. Their story contains lessons about how organized access to a central database can be a powerful marketing force. The Reality Club: THE END OF THEORY. There's a dawning sense that extremely large databases of information, starting in the petabyte level, could change how we learn things. The traditional way of doing science entails constructing a hypothesis to match observed data or to solicit new data. Here's a bunch of observations; what theory explains the data sufficiently so that we can predict the next observation?

It may turn out that tremendously large volumes of data are sufficient to skip the theory part in order to make a predicted observation. Google was one of the first to notice this. For instance, take Google's spell checker. When you misspell a word when googling, Google suggests the proper spelling. How does it know this? In fact, Google uses the same philosophy of learning via massive data for their translation programs. Once you have such a translation system tweaked, it can translate from any language to another.

Chris Anderson is exploring the idea that perhaps you could do science without having theories. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Deja VVVu: Others Claiming Gartner’s Construct for Big Data. In the late 1990s, while a META Group analyst (Note: META is now part of Gartner), it was becoming evident that our clients increasingly were encumbered by their data assets. While many analysts were talking about, many clients were lamenting, and many vendors were seizing the opportunity of these fast-growing data stores, I also realized that something else was going on. Sea changes in the speed at which data was flowing mainly due to electronic commerce, along with the increasing breadth of data sources, structures and formats due to the post Y2K-ERP application boom were as or more challenging to data management teams than was the increasing quantity of data.

In an attempt to help our clients get a handle on how to recognize, and more importantly, deal with these challenges I began first speaking at industry conferences on this 3-dimensional data challenge of increasing data volume, velocity and variety. Date: 6 February 2001 Author: Doug Laney Data Volume Data Velocity Data Variety. What is data science? We’ve all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O’Reilly said that “data is the next Intel Inside.” But what does that statement mean? Why do we suddenly care about statistics and about data? In this post, I examine the many sides of data science — the technologies, the companies and the unique skill sets. The web is full of “data-driven apps.” Almost any e-commerce application is a data-driven application.

One of the earlier data products on the Web was the CDDB database. Google is a master at creating data products. Google’s breakthrough was realizing that a search engine could use input other than the text on the page. Flu trends Google was able to spot trends in the Swine Flu epidemic roughly two weeks before the Center for Disease Control by analyzing searches that people were making in different regions of the country. Google isn’t the only company that knows how to use data. Where data comes from 1956 disk drive. The Data Science Venn Diagram — Drew Conway. On Monday I—humbly—joined a group of NYC's most sophisticated thinkers on all things data for a half-day unconference to help O'Reily organize their upcoming Strata conference. The break out sessions were fantastic, and the number of people in each allowed for outstanding, expert driven, discussions.

One of the best sessions I attended focused on issues related to teaching data science, which inevitably led to a discussion on the skills needed to be a fully competent data scientist. As I have said before, I think the term "data science" is a bit of a misnomer, but I was very hopeful after this discussion; mostly because of the utter lack of agreement on what a curriculum on this subject would look like. The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and where data science fits. L'Aquila quake: Italy scientists guilty of manslaughter. 22 October 2012Last updated at 15:06 ET The BBC's Alan Johnston in Rome says the prosecution argued that the scientists were "just too reassuring" Six Italian scientists and an ex-government official have been sentenced to six years in prison over the 2009 deadly earthquake in L'Aquila.

A regional court found them guilty of multiple manslaughter. Prosecutors said the defendants gave a falsely reassuring statement before the quake, while the defence maintained there was no way to predict major quakes. The 6.3 magnitude quake devastated the city and killed 309 people. Many smaller tremors had rattled the area in the months before the quake that destroyed much of the historic centre.

It took Judge Marco Billi slightly more than four hours to reach the verdict in the trial, which had begun in September 2011. Lawyers have said that they will appeal against the sentence. 'Alarming' case Continue reading the main story Sadly, the issue is not "if" but "when" the next tremor will occur in L'Aquila. Flu Trends | United States. Google Flu Trends Wildly Overestimated This Year's Flu Outbreak - David Wagner. Scientific hindsight shows that Google Flu Trends far overstated this year's flu season, raising questions about the accuracy of using a search engine, which Google and the media hyped as an efficient public health tool, to accurately monitor the flu.

Nature's Declan Butler reported today on the huge discrepancy between Google Flu Trend's estimated peak flu levels and data collected by the U.S. Centers for Disease Control and Prevention (CDC) earlier this winter. Google bases their numbers on flu-related searches (the basic idea being that more people Googling terms like "flu symptoms" equals more people catching viruses). The CDC, on the other hand, uses traditional epidemiological surveillance methods. Past results have shown Google to have a pretty good track record on mirroring CDC flu charts. There's no doubt that this year's flu season was severe.

Lots of media attention to this year's flu season skewed Google's search engine traffic. This wasn't supposed to happen. The Expression of Emotions in 20th Century Books. We report here trends in the usage of “mood” words, that is, words carrying emotional content, in 20th century English language books, using the data set provided by Google that includes word frequencies in roughly 4% of all books published up to the year 2008. We find evidence for distinct historical periods of positive and negative moods, underlain by a general decrease in the use of emotion-related words through time. Finally, we show that, in books, American English has become decidedly more “emotional” than British English in the last half-century, as a part of a more general increase of the stylistic divergence between the two variants of English language. Figures Citation: Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030 Editor: Sune Lehmann, Technical University of Denmark, Denmark Received: July 26, 2012; Accepted: February 11, 2013; Published: March 20, 2013 Introduction.

Latent Semantic Analysis in Python. Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships. An example of LSA: Using a search engine search for “sand”. Documents are returned which do not contain the search term “sand” but contains terms like “beach”. LSA has identified a latent relationship, “sand” is semantically close to “beach”. There are some very good papers which describing LSA in detail: This is an implementation of LSA in Python (2.4+). 1 Create the term-document matrix We use the previous work in Vector Space Search to build this matrix. 2 tf-idf Transform Apply the tf-idf transform to the term-document matrix. 3 Singular Value Decomposition SVD: Determine U, Sigma, VT from our MATRIX from previous steps.

Finally we calculate: Problems. Dan McKinley :: Whom the Gods Would Destroy, They First Give Real-time Analytics. Homer: There's three ways to do things. The right way, the wrong way, and the Max Power way! Bart: Isn't that the wrong way? Homer: Yeah. But faster! Every few months, I try to talk someone down from building a real-time product analytics system. The turnaround time for most of the web analysis done at Etsy is at least 24 hours. Here's an excerpt from a manifesto demanding the construction of such a system. We believe in... The 23-year-old programmer inside of me is salivating at the idea of building this. But the 33-year-old programmer (who has long since beaten those demons into a bloody submission) sees the difficulty as irrelevant at best.

Engineers might find this assertion more puzzling than most. This line of thinking is a trap. So what is it that makes product analysis different? The first and most fundamental way is to disregard statistical significance testing entirely. One could certainly have a real-time analytics system without making any of these mistakes. Pat Hanrahan - Tools for Data Enthusiasts. The Joy of Stats. About the video Hans Rosling says there’s nothing boring about stats, and then goes on to prove it. A one-hour long documentary produced by Wingspan Productions and broadcast by BBC, 2010. A DVD is available to order from Wingspan Productions. Director & Producer; Dan Hillman, Executive Producer: Archie Baron.

©Wingspan Productions for BBC, 2010 The change from large to small families reflects dramatic changes in peoples lives. Hans Rosling asks: Has the UN gone mad? Hans Rosling explains a very common misunderstanding about the world: That saving the poor children leads to overpopulation. IEEE Conference on Data Mining.