Introduction to Inference
Goal: Create a mathematical model for some random variable, or for an association between two or more random variables.
This may involve:
Creating a new model
Evaluating an existing model
Model for a random variable: A distribution with some parameters.
Model for an association between random variables: It’s complicated.
The random variable(s) is (are) tied to some “population”:
We then talk about the population distribution, the population parameter, and a model of the population.
Population: All potential patients in the world that currently have certain disease, or had the disease in the past, or will have the disease in the future
Population: All plants of certain species in a given forest
By looking at a sample, figure out exactly what model to use for the population variable.
That means complete description of the distribution, including the exact values of all the parameters.
We only have a sample of the values, that’s not going to be enough.
Samples vary!!!
Figure out something about one of the parameters, or
Figure out something about the way the variable is distributed.
Perhaps we already have some idea about the type of the distribution, can we learn something about its parameter(s)?
Simplest case: The variable can be modeled by a Bernoulli distribution with some (unknown) probability of success \(p\).
Certain (unknown) proportion \(p\) of residents of a city are infected with some virus.
Certain (unknown) proportion \(p\) of voters will vote for a specific candidate.
Certain (unknown) proportion \(p\) of gadgets made in a factory are faulty.
In a sample of \(n\) independent value, the number of successes \(x\) will be a \(\operatorname{Binom}(n, p)\) random variable.
In a sample of \(n\) independent value, the number of successes \(x\) will be a \(\operatorname{Binom}(n, p)\) random variable.
We say that \(x\) is a sample statistic, and its sampling distribution is \(\operatorname{Binom}(n, p)\).
Another, better, sample statistic is the sample proportion of successes: \[\widehat{p} = \frac{x}{n}\]
\(\widehat p\) is a point estimate for \(p\).
Sampling distribution of \(x\) is \(\operatorname{Binom}(n, p)\).
The mean of \(x\) is \(np\), the variance of \(x\) is \(np(1-p)\), and the standard error of \(x\) is \(\sqrt{np(1-p)}\).
If \(np\) and \(n(1-p)\) are large enough, we can approximate the sampling distribution of \(x\) by \(\displaystyle\operatorname{N}\left(np, \sqrt{np(1-p)}\right)\).
What is the mean, variance, and standard error of \(\widehat{p}\)?
The sampling distribution of \(\widehat{p}\) can be approximated by \(\displaystyle\operatorname{N}\left(p, \sqrt{\frac{p(1-p)}{n}}\right)\).
Samples consists of \(n\) independent values of the same Bernoulli variable with probability of success \(p\).
The success-failure condition: Both \(np\) and \(n(1-p)\) are sufficiently large.
Then the sample proportions \(\widehat{p}\) are approximately normally distributed with mean \(p\) and standard error \[\sqrt{\frac{p(1-p)}{n}}\]
Population: \(p = 0.7\)
Sample size: \(n = 100\)
Sampling distribution for \(\widehat{p}\):
A sample of size \(n = 100\) is drawn from a population in which \(p = .7\).