background preloader

Scientific method: Statistical errors

Scientific method: Statistical errors
For a brief moment in 2010, Matt Motyl was on the brink of scientific glory: he had discovered that extremists quite literally see the world in black and white. The results were “plain as day”, recalls Motyl, a psychology PhD student at the University of Virginia in Charlottesville. Data from a study of nearly 2,000 people seemed to show that political moderates saw shades of grey more accurately than did either left-wing or right-wing extremists. “The hypothesis was sexy,” he says, “and the data provided clear support.” The P value, a common index for the strength of evidence, was 0.01 — usually interpreted as 'very significant'. Publication in a high-impact journal seemed within Motyl's grasp. But then reality intervened. It turned out that the problem was not in the data or in Motyl's analyses. For many scientists, this is especially worrying in light of the reproducibility concerns. Out of context P values have always had critics. What does it all mean? Numbers game Related:  the existential quest of psychological science for its soulStatsScientific Theory and Praxes

The British amateur who debunked the mathematics of happiness | Science | The Observer Nick Brown does not look like your average student. He's 53 for a start and at 6ft 4in with a bushy moustache and an expression that jackknifes between sceptical and alarmed, he is reminiscent of a mid-period John Cleese. He can even sound a bit like the great comedian when he embarks on an extended sardonic riff, which he is prone to do if the subject rouses his intellectual suspicion. A couple of years ago that suspicion began to grow while he sat in a lecture at the University of East London, where he was taking a postgraduate course in applied positive psychology. There was a slide showing a butterfly graph – the branch of mathematical modelling most often associated with chaos theory. On the graph was a tipping point that claimed to identify the precise emotional co-ordinates that divide those people who "flourish" from those who "languish". According to the graph, it all came down to a specific ratio of positive emotions to negative emotions. It was as simple as that.

LineUp Rankings are a popular and universal approach to structuring otherwise unorganized collections of items by computing a rank for each item based on the value of one or more of its attributes. This allows us, for example, to prioritize tasks or to evaluate the performance of products relative to each other. While the visualization of a ranking itself is straightforward, its interpretation is not, because the rank of an item represents only a summary of a potentially complicated relationship between its attributes and those of the other items. In our paper we present a comprehensive analysis of requirements for the visualization of multi-attribute rankings. Additionally, through integration of slope graphs, LineUp can also be used to compare multiple alternative rankings on the same set of items, for example, over time or across different attribute combinations. Paper (PDF, 1.4 MB) Samuel Gratzl @ InfoVis '13 Talk (PDF, 3.8 MB) Talk (PPTX, 27.8 MB) Supplementary Material (ZIP, 1 MB) Windows

Still Not Significant What to do if your p-value is just over the arbitrary threshold for ‘significance’ of p=0.05? You don’t need to play the significance testing game – there are better methods, like quoting the effect size with a confidence interval – but if you do, the rules are simple: the result is either significant or it is not. So if your p-value remains stubbornly higher than 0.05, you should call it ‘non-significant’ and write it up as such. The problem for many authors is that this just isn’t the answer they were looking for: publishing so-called ‘negative results’ is harder than ‘positive results’. The solution is to apply the time-honoured tactic of circumlocution to disguise the non-significant result as something more interesting. As well as being statistically flawed (results are either significant or not and can’t be qualified), the wording is linguistically interesting, often describing an aspect of the result that just doesn’t exist.

Scientific Regress by William A. Wilson The problem with ­science is that so much of it simply isn’t. Last summer, the Open Science Collaboration announced that it had tried to replicate one hundred published psychology experiments sampled from three of the most prestigious journals in the field. Scientific claims rest on the idea that experiments repeated under nearly identical conditions ought to yield approximately the same results, but until very recently, very few had bothered to check in a systematic way whether this was actually the case. The OSC was the biggest attempt yet to check a field’s results, and the most shocking. In many cases, they had used original experimental materials, and sometimes even performed the experiments under the guidance of the original researchers. Of the studies that had originally reported positive results, an astonishing 65 percent failed to show statistical significance on replication, and many of the remainder showed greatly reduced effect sizes. What about accuracy? So the dogma goes.

Is social psychology really in crisis? The headlines Disputed results a fresh blow for social psychology Replication studies: Bad copy The story Controversy is simmering in the world of psychology research over claims that many famous effects reported in the literature aren’t reliable, or may even not exist at all. The latest headlines follow the publication of experiments which failed to replicate a landmark study by Dutch psychologist Ap Dijksterhuis. What they actually did The first of Dijksterhuis' original experiments asked people to think about the typical university professor and list on paper their appearance, lifestyle and behaviours. The experiment found that people who had thought about professors scored 10% higher than people who hadn’t been primed in this way. How plausible is it It’s extremely plausible that people are influenced by recent activities and thoughts - the concept of priming is beyond question, having been supported by decades of research. Tom’s take Read more Shanks, D. Rolf Zwaan on replication done right

The Baloney Detection Kit: Carl Sagan’s Rules for Bullshit-Busting and Critical Thinking By Maria Popova Carl Sagan was many things — a cosmic sage, voracious reader, hopeless romantic, and brilliant philosopher. But above all, he endures as our era’s greatest patron saint of reason and common sense, a master of the vital balance between skepticism and openness. In The Demon-Haunted World: Science as a Candle in the Dark (public library) — the same indispensable volume that gave us Sagan’s timeless meditation on science and spirituality, published mere months before his death in 1996 — Sagan shares his secret to upholding the rites of reason, even in the face of society’s most shameless untruths and outrageous propaganda. Through their training, scientists are equipped with what Sagan calls a “baloney detection kit” — a set of cognitive tools and techniques that fortify the mind against penetration by falsehoods: The kit is brought out as a matter of course whenever new ideas are offered for consideration. Sagan ends the chapter with a necessary disclaimer:

Use standard deviation (not mad about MAD) Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get totals and averages correct. Absolute deviation measures do not prefer such models. So while MAD may be great for reporting, it can be a problem when used to optimize models. Let’s suppose we have 2 boxes of 10 lottery tickets: all tickets were purchased for $1 each for the same game in an identical fashion at the same time. Now since all tickets are identical if we are making a mere point-prediction (a single number value estimate for each ticket instead of a detailed posterior distribution) then there is an optimal prediction that is a single number V. Suppose we use mean absolute deviation as our measure of model quality. Be Sociable, Share!

theconversation Research and creative thinking can change the world. This means that academics have enormous power. But, as academics Asit Biswas and Julian Kirchherr have warned, the overwhelming majority are not shaping today’s public debates. Instead, their work is largely sitting in academic journals that are read almost exclusively by their peers. Up to 1.5 million peer-reviewed articles are published annually. This suggests that a lot of great thinking and many potentially world altering ideas are not getting into the public domain. The answer appears to be threefold: a narrow idea of what academics should or shouldn’t do; a lack of incentives from universities or governments; and a lack of training in the art of explaining complex concepts to a lay audience. The ‘intellectual mission’ Some academics insist that it’s not their job to write for the general public. The counter argument is that academics can’t operate in isolation from the world’s very real problems. No incentives Learning to write

Data Colada | [19] Fake Data: Mendel vs. Stapel Diederik Stapel, Dirk Smeesters, and Lawrence Sanna published psychology papers with fake data. They each faked in their own idiosyncratic way, nevertheless, their data do share something in common. Real data are noisy. Theirs aren’t. Gregor Mendel’s data also lack noise (yes, famous peas-experimenter Mendel). Because Mendel, unlike the psychologists, had a motive. Excessive similarityTo get a sense for what we are talking about, let’s look at the study that first suggested Smeesters was a fabricateur. (See retracted paper .pdf) Results are as predicted. Stapel and Sanna had data with the same problem. How Mendel is like Stapel, Smeesters & SannaMendel famously crossed plants and observed the share of baby-plants with a given trait. Recall how Smeesters’ data had 27/100000 chance if data were real? How Mendel is not like Stapel, Smeesters, SannaMendel wanted his data to look like his theory. Imagine Mendel runs an experiment and gets 27% instead of 33% of baby-plants with a trait.

The Measurement of the Thing: Thinking About Metrics, Altmetrics and How to Beat Goodhart’s Law In the early decades of the 20th Century, there was a big problem with the Universe. “Man is the measure of all things: of things which are, that they are, and of things which are not, that they are not” Protagoras BC 490 – BC 420 Well, not so much the universe, as our ability to measure it. We didn’t know how big it was. In fact we didn’t know whether the ‘nebulae’ visible in our telescopes were of the Milky Way, or collections of multitudinous stars and worlds way beyond our galaxy. In fact, we didn’t know if there was a “beyond our galaxy”. And then we discovered the Cepheid Variable. In 1923, Edwin Hubble used this class of stars to show that the Andromeda Galaxy lay outside the boundaries of our Milky Way. The other thing about standards is that they get used as the measure of achievement. Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes “If only there was a better way” goes the cry. I’ll cut to the chase. Like this:

Taleb - Deviation The notion of standard deviation has confused hordes of scientists; it is time to retire it from common use and replace it with the more effective one of mean deviation. Standard deviation, STD, should be left to mathematicians, physicists and mathematical statisticians deriving limit theorems. There is no scientific reason to use it in statistical investigations in the age of the computer, as it does more harm than good—particularly with the growing class of people in social science mechanistically applying statistical tools to scientific problems. Say someone just asked you to measure the "average daily variations" for the temperature of your town (or for the stock price of a company, or the blood pressure of your uncle) over the past five days. The five changes are: (-23, 7, -3, 20, -1). Do you take every observation: square it, average the total, then take the square root? It all comes from bad terminology for something non-intuitive.

Karl Popper: What Makes a Theory Scientific It’s not immediately clear, to the layman, what the essential difference is between science and something masquerading as science: pseudoscience. The distinction gets at the core of what comprises human knowledge: How do we actually know something to be true? Is it simply because our powers of observation tell us so? Or is there more to it? Sir Karl Popper, the scientific philosopher, was interested in the same problem. When I received the list of participants in this course and realized that I had been asked to speak to philosophical colleagues I thought, after some hesitation and consultation, that you would probably prefer me to speak about those problems which interest me most, and about those developments with which I am most intimately acquainted. Popper saw a problem with the number of theories he considered non-scientific that, on their surface, seemed to have a lot in common with good, hard, rigorous science. This is a popularly accepted notion. What was the missing element?

Related: