 # Statistics

Quantile Regression - Econometrics Academy. Monte Carlo Simulation Basics. [ Preface ] [ Sales Forecast Example ] A Monte Carlo method is a technique that involves using random numbers and probability to solve problems. The term Monte Carlo Method was coined by S. Ulam and Nicholas Metropolis in reference to games of chance, a popular attraction in Monte Carlo, Monaco (Hoffman, 1998; Metropolis and Ulam, 1949). Computer simulation has to do with using computer models to imitate real life or make predictions. This type of model is usually deterministic, meaning that you get the same results no matter how many times you re-calculate. [ Example 1: A Deterministic Model for Compound Interest ] Figure 1: A parametric deterministic model maps a set of input variables to a set of output variables.

Monte Carlo simulation is a method for iteratively evaluating a deterministic model using sets of random numbers as inputs. In Example 2, we used simple uniform random numbers as the inputs to the model. Uncertainty Propagation If you have made it this far, congratulations! Power Analysis. Overview Power analysis is an important aspect of experimental design. It allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence. Conversely, it allows us to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints. If the probability is unacceptably low, we would be wise to alter or abandon the experiment. The following four quantities have an intimate relationship: sample size effect size significance level = P(Type I error) = probability of finding an effect that is not there power = 1 - P(Type II error) = probability of finding an effect that is there Given any three, we can determine the fourth.

Power Analysis in R The pwr package develped by Stéphane Champely, impliments power analysis as outlined by Cohen (! The significance level defaults to 0.05. Specifying an effect size can be a daunting task. T-tests For t-tests, use the following functions: How To Determine Sample Size, Determining Sample Size. In order to prove that a process has been improved, you must measure the process capability before and after improvements are implemented.

This allows you to quantify the process improvement (e.g., defect reduction or productivity increase) and translate the effects into an estimated financial result – something business leaders can understand and appreciate. If data is not readily available for the process, how many members of the population should be selected to ensure that the population is properly represented? If data has been collected, how do you determine if you have enough data? Determining sample size is a very important issue because samples that are too large may waste time, resources and money, while samples that are too small may lead to inaccurate results. In many cases, we can easily determine the minimum sample size needed to estimate a process parameter, such as the population mean When sample data is collected and the sample mean . Where: is the sample size. to within . . R Tutorials--Logistic Regression.

Preliminaries Model Formulae You will need to know a bit about Model Formulae to understand this tutorial. Odds, Odds Ratios, and Logit When you go to the track, how do you know which horse to bet on? You look at the odds. P(one outcome) p(success) p odds = -------------------- = ----------- = ---, where q = 1 - p p(the other outcome) p(failure) q So for Sea Brisket, odds(winning) = (1/9)/(8/9) = 1/8. The natural log of odds is called the logit, or logit transformation, of p: logit(p) = loge(p/q). If odds(success) = 1, then logit(p) = 0. Logistic regression is a method for fitting a regression curve, y = f(x), when y consists of proportions or probabilities, or binary coded (0,1--failure,success) data. Y = [exp(b0 + b1x)] / [1 + exp(b0 + b1x)] Logistic regression fits b0 and b1, the regression coefficients (which were 0 and 1, respectively, for the graph above). Logit(y) = b0 + b1x Odds ratio might best be illustrated by returning to our horse race.

Numerous explanation are in order!

## Networks

The p value and the base rate fallacy. You’ve already seen that p values are hard to interpret. Getting a statistically insignificant result doesn’t mean there’s no difference. What about getting a significant result? Let’s try an example. Suppose I am testing a hundred potential cancer medications. Only ten of these drugs actually work, but I don’t know which; I must perform experiments to find them. In these experiments, I’ll look for p<0.05 gains over a placebo, demonstrating that the drug has a significant benefit. To illustrate, each square in this grid represents one drug.

As we saw, most trials can’t perfectly detect every good medication. Of the ninety ineffectual drugs, I will conclude that about 5 have significant effects. So I perform my experiments and conclude there are 13 working drugs: 8 good drugs and 5 I’ve included erroneously, shown in red: The chance of any given “working” drug being truly effectual is only 62%. You often hear people quoting p values as a sign that error is unlikely. Whoops. What happened? R Tutorials--Chi Square Test of Independence. Syntax From the help page, the syntax of the chisq.test( ) function is... chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) This function is used for both the goodness of fit test and the test of independence, and which test it does depends upon what kind of data you feed it.

If "x" is a numerical vector or a one-dimensional table of numerical values, a goodness of fit test will be done (or attempted), treating "x" as a vector of observed frequencies. Ignore "y". Textbook Problems The following textbook-like problem uses data from Hand et al. (1994)...

Senie et al. (1981) investigated the relationship between age and frequency of breast self-examination in a sample of women (Senie, R. The data have already been tabled for us in most textbook problems. If the data are available in an electronic document, like this one, it can be entered into R using the scan( ) function... Data From a Table Object. The p value and the base rate fallacy. How to calculate pseudo-\$R^2\$ from R's logistic regression.