An intro to power and sample size estimation -- Jones et al. 20 (5): 453 -- Emergency Medicine Journal + Author Affiliations Correspondence to: Dr S R Jones, Emergency Department, Manchester Royal Infirmary, Oxford Road, Manchester M13 9WL, UK; firstname.lastname@example.org Abstract The importance of power and sample size estimation for study design and analysis. Understand power and sample size estimation. Power and sample size estimations are measures of how many patients are needed in a study. In previous articles in the series on statistics published in this journal, statistical inference has been used to determine if the results found are true or possibly due to chance alone. Power and sample size estimations are used by researchers to determine how many subjects are needed to answer the research question (or null hypothesis). An example is the case of thrombolysis in acute myocardial infarction (AMI). Generally these trials compared thrombolysis with placebo and often had a primary outcome measure of mortality at a certain number of days. Figure 1 Figure 2 Table 2 From Egbert’s scribblings:
40 Techniques Used by Data Scientists These techniques cover most of what data scientists and related practitioners are using in their daily activities, whether they use solutions offered by a vendor, or whether they design proprietary tools. When you click on any of the 40 links below, you will find a selection of articles related to the entry in question. Most of these articles are hard to find with a Google search, so in some ways this gives you access to the hidden literature on data science, machine learning, and statistical science. Starred techniques (marked with a *) belong to what I call deep data science, a branch of data science that has little if any overlap with closely related fields such as machine learning, computer science, operations research, mathematics, or statistics. To learn more about deep data science, click here. Also, to discover in which contexts and applications the 40 techniques below are used, I invite you to read the following articles: The 40 data science techniques DSC Resources
R scripts for analyzing survey data Another site pops up with open code for analyzing public survey data: It will be interesting to see whether this gets used by the general public--given the growing trend of data journalism and so forth--versus academics. It is a useful resource for both. To leave a comment for the author, please follow the link and comment on his blog: The Data Monkey. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data Marketplace It takes a lot of data of all kinds to add up to Big Data. That's why we've assembled this awesome collection of datasets and streams for data scientists and developers to sample, experiment with and use to create awesome analytics and applications. Whether you need to tap into social, geo or other kinds of data we've got just what you need. agebirthscensuscharacterchemistrycommoditiescorporadeathdemographicsdemographicseconomicsemploymentfootballgeonamesgovernmenthealthhealthhousingincomelanguagelanguagelawliteraturelocationslongitudemapsmusicnationalpollutionpopulationsciencesciencesize-largesocialsocialspendingsportsstatisticssurveytwitterwordzipcode
Video: Survey Package in R Sebastián Duchêne presented a talk at Melbourne R Users on 20th February 2013 on the Survey Package in R. Talk Overview: Complex designs are common in survey data. In practice, collecting random samples from a populations is costly and impractical. About the presenter: Sebastián Duchêne is a Ph.D. candidate at The University of Sydney, based at the Molecular Phylogenetics, Ecology, and Evolution Lab. See here for the full list of Melbourne R User Videos. Handy statistical lexicon These are all important methods and concepts related to statistics that are not as well known as they should be. I hope that by giving them names, we will make the ideas more accessible to people: Mister P: Multilevel regression and poststratification. The Secret Weapon: Fitting a statistical model repeatedly on several different datasets and then displaying all these estimates together. The Superplot: Line plot of estimates in an interaction, with circles showing group sizes and a line showing the regression of the aggregate averages. The Folk Theorem: When you have computational problems, often there’s a problem with your model. The Pinch-Hitter Syndrome: People whose job it is to do just one thing are not always so good at that one thing. Weakly Informative Priors: What you should be doing when you think you want to use noninformative priors. P-values and U-values: They’re different. Conservatism: In statistics, the desire to use methods that have been used before. P.S.
Get JSON from Excel using Python, xlrd | Anthony DeBarros Powering interactive news applications off flat files rather than a call to a database server is an option worth considering. Cutting a production database and data access layer out of the mix eliminates a whole slice of complexity and trims development time. Flat files aren’t right for every situation, but for small apps they’re often all you need. These days, most of the apps I help build at Gannett Digital consume JSON. Simpler apps — such as the table/modal displays we deployed in February for our Oscar Scorecard and Princeton Review Best Value Colleges — run off one or two JSON files. I wrote last year how to use Python to generate JSON files from a SQL database. The key ingredient is the Python library xlrd. (Another choice is openpyxl, which has similar features and works with newer .xlsx formatted Excel files. Basic xlrd operations Let’s say we have an Excel workbook containing a small table repeated over three worksheets. From Excel to JSON Pretty cool stuff.
Drinking, sex, eating: Why don't we tell the truth in surveys? 27 February 2013Last updated at 13:56 GMT By Brian Wheeler BBC News Magazine Many people are under-reporting how much alcohol they are drinking. But what else are we fibbing to researchers about and why do we do it? "I have the occasional sweet sherry. It is a classic British sitcom scene. But the tendency to paint a less-than-honest picture about your unhealthy habits and lifestyle is not just restricted to alcohol. It is understandable that people want to present a positive image of themselves to friends, family and colleagues. After all, the man or woman from the Office for National Statistics or Ipsos Mori can't order you to go on a diet or lay off the wine. It is a question that has been puzzling social scientists for decades. They even have a name for it - The Social Desirability Bias. "People respond to surveys in the way they think they ought to. The recycling never lies It is a particular problem when it comes to "sins" such as alcohol and food. Continue reading the main story
Toward sustainable insights, or why polygamy is bad for you | the morning paper Toward sustainable insights, or why polygamy is bad for you Binning et al., CIDR 2017 Buckle up! Today we’re going to be talking about statistics, p-values, and the multiple comparisons problem. For my own benefit, I’ll try and explain what follows as simply as possible – I find it incredibly easy to make mistakes otherwise! p-values If we observe some variable and see value , we might wonder “what are the odds of that!” we’d be able to give an answer. about the underlying distribution. will be given that hypothesis, or : . Time to move on from dice rolls. we observe is now a measure of correlation between two measured phenomena. exactly equal to some value we need to ask ‘what are the odds of seeing a value (or )?’ . Suppose we see a suspiciously large value. p-value = (source: wikipedia) Here’s the first thinking trap. An arbitrary but universally accepted p-value of 0.05 (there’s a 5% chance of this observation given the hypothesis) is deemed as the threshold for ‘statistical significance.’ .
Tableau Tip: Embedding dashboards from multiple, disparate workbooks into a single workbook Here’s the situation: You have several people creating their own dashboards in separate workbooks Your boss doesn’t want to open all of the dashboards separately You need all of these diverse dashboards in a single dashboard The solution is way easier than you think. It’s simply a matter of using web page objects. For this example, I’m going to use these three views: Here’s how it’s done: Step 1 – You have to connect to some kind of data. Data as simple as this will work: Step 2 – Create a new dashboard Step 3 – Add a web page object to the dashboard. You should now see your dashboard inside the dashboard. Notice also that if you have tabs enabled on the dashboard, you can see all of those tabs in the web page object. Step 4 – Repeats steps 2 & 3 for each of the views in the other workbooks. That’s it!
What is a large enough random sample? With the well deserved popularity of A/B testing computer scientists are finally becoming practicing statisticians. One part of experiment design that has always been particularly hard to teach is how to pick the size of your sample. The two points that are hard to communicate are that: The required sample size is essentially independent of the total population size.The required sample size depends strongly on the strength of the effect you are trying to measure. These things are only hard to explain because the literature is overly technical (too many buzzwords and too many irrelevant concerns) and these misapprehensions can’t be relieved unless you spend some time addressing the legitimate underlying concerns they are standing in for. As usual explanation requires common ground (moving to shared assumptions) not mere technical bullying. We will try to work through these assumptions and then discuss proper sample size. The problem of population size. The problem of effect strength.
Majority to minority: the declining U.S. white population]Quand la majorité devient minorité : le cas des blancs aux Etats-Unis | N-IUSSP In this essay we document the demography of the decline of the white population in the United States, a country with a long history of white supremacy. Despite the fact that the U.S. Constitution and the civil rights legislation of the 1960s guaranteed equality to all people irrespective of race or ethnicity, everyone is far from equal in the United States today. On average, whites are far better off economically and educationally and in many other ways than are minority peoples. Levels of residential segregation by race and ethnicity are still nearly as high as they were decades ago. Donald Trump won the U.S. presidential election by focusing his attention on white people. Let us first note that U.S. federal government agencies use two questions to measure race/ethnicity. First immigrants to the U.S. Whites were not the first people to settle in what is now the United States. The first sizeable stream of immigrants to the U.S. were whites from England. Whites and minorities in the U.S.
The Pith of Performance: Extracting the Epidemic Model: Going Beyond Florence Nightingale Part II This is the second of a two part reexamination of Florence Nightingale's data visualization based on her innovative cam diagrams (my term) shown in Figure 1. Figure 1. Nightingale's original cam diagrams (click to enlarge) Recap In Part I, I showed that FN applied sectoral areas, rather than a pie chart or conventional histogram, to reduce the visual impact of highly variable zymotic disease data from the Crimean War. Figure 2. Although R has some very sophisticated tools for producing almost any kind of data visualization—some of which I used in Part I—a very impressive interactive version of Fig. 1 has been created by the Statistical Laboratory at Cambridge University using Adobe Flash (may need browser plug-in). This kind of visualization requires a lot of work to construct and is perfect as an educational tool but complete overkill for typical data exploration. Back to Bar Charts Using R to create Fig. 3, I've reproduced a static version of the interactive Flash bar chart. Figure 3.