background preloader

Must-Have R Packages for Social Scientists

Must-Have R Packages for Social Scientists
This happens to be one of those rare instances where the benefit of hindsight does not make me regret something said flippantly on a panel. I deeply believe that in order to truly change the world we cannot simply "throw analytics at the problem." To that end, the medical and health industries are perhaps the most primed to be disrupted by data and analytics. To be successful, however, a deep respect for both the methodological and clinical contexts of the data are required. It is incredibly exciting to be at an organization that is both working within the current framework of health care and data to create new insight for people, but also pushing the envelope with respect to individuals' relationships with their own health. The challenges are technical, sociological, and political; but the potential for innovation that exists in this space comes along very rarely. I feel lucky to have an opportunity to move into the health data space now. Sensor data Strength of team

http://drewconway.com/zia/

CSI Math simpleR Using R for Introductory Statistics. By John Verzani Version 0.4 (August 22, 2002). printable versions Skip to the table of contents. The Afghanistan War Logs Released by Wikileaks, the World's First Stateless News Organization July 26, 2010 The Afghanistan War Logs Released by Wikileaks, the World's First Stateless News Organization "In media history up to now, the press is free to report on what the powerful wish to keep secret because the laws of a given nation protect it. But Wikileaks is able to report on what the powerful wish to keep secret because the logic of the Internet permits it.

The key word in “Data Science” is not Data, it is Science One of my colleagues was just at a conference where they saw a presentation about using data to solve a problem where data had previously not been abundant. The speaker claimed the data were "big data" and a question from the audience was: "Well, that isn't really big data is it, it is only X Gigabytes". While that exact question would elicit groans from most people who work with data, I think it highlights one of the key problems with the thinking around data science. Most people hyping data science have focused on the first word: data. They care about volume and velocity and whatever other buzzwords describe data that is too big for you to analyze in Excel. This hype about the size (relative or absolute) of the data being collected fed into the second category of hype - hype about tools.

Wikileaks Data Spurs App Development - ReadWriteCloud While politicians, pundits, military, and journalists assess and debate the fallout from Wikileaks' release of the "Afghan War Diary" - the legality and ethics of Wikileaks, its impact on the war efforts, the rise of the "world's first stateless news organization" - a number of developers are diving right into the 91,000 some odd classified documents and seeing what they can do with the data. And it's a substantial chunk of data. The documents dated from 2004 to 2010 are available in HTML, CSV, or SQL formats, as well as several KML files. But even in the HTML format, reading through the Afghan War Diary is no easy task. This is no Stephen Ambrose-presentation of history.

Statistics and the Science Club One of my favorite movies is Woody Allen’s Annie Hall. If you’re my age and you haven’t seen it, I usually tell people it’s like When Harry Met Sally, except really good. The movie opens with Woody Allen’s character Alvy Singer explaining that he would “never want to belong to any club that would have someone like me for a member”, a quotation he attributes to Groucho Marx (or Freud). George Soros Open Society Institute CIA Inquiry on Wikileaks 10 August 2010 Subject: FW: Site Submission From Contact Us Form Date: Tue, 10 Aug 2010 09:31:36 -0400 From: "Amy P. Weil" <AWeil[at]sorosny.org> To: <cryptome[at]earthlink.net> Dear John Young, Thank you for your query.

Cooperation between Referees and Authors Increases Peer Review Accuracy Peer review is fundamentally a cooperative process between scientists in a community who agree to review each other's work in an unbiased fashion. Peer review is the foundation for decisions concerning publication in journals, awarding of grants, and academic promotion. Here we perform a laboratory study of open and closed peer review based on an online game. We show that when reviewer behavior was made public under open review, reviewers were rewarded for refereeing and formed significantly more cooperative interactions (13% increase in cooperation, P = 0.018). We also show that referees and authors who participated in cooperative interactions had an 11% higher reviewing accuracy rate (P = 0.016).

Marc A. Thiessen - WikiLeaks' blow to the surge WikiLeaks founder Julian Assange has made clear that his objective in releasing tens of thousands of classified documents was to "end the war in Afghanistan" and "oppose an unjust [war] plan before it reaches implementation." He may well achieve his goal. Assange's illegal disclosures are helping the Taliban to undermine Gen. Determining the number of clusters in a data set Determining the number of clusters in a data set, a quantity often labeled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular k-means, k-medoids and Expectation-maximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether. The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. Rule of thumb[edit]

Open Source Tools Turn WikiLeaks Into Illustrated Afghan Meltdown (Updated) It’s one thing to read about individual Taliban attacks in WikiLeaks’ trove of war logs. It’s something quite different to see the bombings and the shootings mount, and watch the insurgency metastasize. NYU political science grad student (and occasional Danger Room contributor) Drew Conway has done just that, using an open source statistical programming language called R and a graphical plotting software tool. The results are unnerving, like stop-motion photography of a freeway wreck. Above is the latest example: a graph showing the spread of combat from 2004 to 2009. The importance of simulating the extremes Simulation is commonly used by statisticians/data analysts to: (1) estimate variability/improve predictors , (2) to evaluate the space of potential outcomes , and (3) to evaluate the properties of new algorithms or procedures. Over the last couple of days, discussions of simulation have popped up in a couple of different places. First, the reviewers of a paper that my student is working on had asked a question about the behavior of the method in different conditions.

‘Afghan Insurgency Can Sustain Itself Indefinitely’: Top U.S. Intel Officer The Taliban not only has the “momentum” after the most successful year in its campaign against the United States and the Kabul government. “The Afghan insurgency can sustain itself indefinitely,” according to a briefing from Major General Michael Flynn, the top U.S. intelligence officer in the country. “The Taliban retains [the] required partnerships to sustain support, fuel legitimacy and bolster capacity.”

Related: