Introduction to the theory of statistics
Imagine you work at a hospital. Dozens of babies are delivered in the hospital each week. A lot of information is collected about each birth, and hiding inside of these data is a lot of useful information about the health and future of the community. How can we extract it? If you take classes like STA 101 or 199, you learn various methods:
| Question | Method |
|---|---|
| How are birth weights distributed? | histogram |
| What’s the typical birth weight? | sample average |
| How is birth weight related to gestational age? | linear regression |
The goal of data science in general is to convert data into knowledge, and we perform this “conversion” by applying statistical methods: histogram, sample average, line of best fit, etc.
A mathematical statistician looks at these methods and wonders: do they actually work? Do they behave the way we expect? Do they deliver what they promise? Are they reliable? How reliable? Under what conditions? And before we can answer these questions, we have to back up and ask a more fundamental one: what does it even mean for a statistical method to “work”? These are all theoretical questions, and mathematical statisticians answer them using the tools of probability theory that we have studied for the last twelve weeks.
Baby’s first dataset
For us, a data set will be a spreadsheet with one column of numbers:
XXX
Of course, in the modern era, datasets are huge. There could be millions of columns. There could be so many columns and rows that you can’t fit the entire dataset on a single computer. Welcome to “big data.” Furthermore, modern datasets are weird. It’s not just a box of numbers anymore. Text is data. Images are data. Video is data. Sometimes all at once.
If you continue studying statistics, you’ll get there eventually. But for now, let’s keep it simple.
The implicit assumption of all statistics
Obsered data are the result of a random process.
By the time the data arrive in our spreadsheet, they are a fixed set of numbers. But how did those numbers get there in the first place? Plenty of seemingly random forces might have had their way before you ever got to observe anything:
- (nature’s randomness) many of the phenomena we study have an intrinsic random component that is simply irreducible. Think mutations during DNA replication, the quantum behavior of subatomic particles, or the “random walk” in stock prices;
- (human error) data are collected by humans, and humans make mistakes;
- (measurement error) the laboratory devices we use to collect measurements in the sciences are far from perfect;
- (study design) the gold standard for teasing out cause and effect is the randomized controlled trial (RCT) where, by design, the researcher randomly divides subjects into treatment and control groups;
- (survey non-response) much of our economic data on unemployment, inflation, and the behavior of firms is collected by survey. Political polls are a type of survey. Course evaluations are a survey. Say you issue a survey to a population of interest, and only 10% of respond? Who are they? Why did they respond? Why did the other 90% abstain? There’s a lot of randomness going here, and you need to get your arms around it if you want to interpret the survey results correctly.
For all of these reasons and more, it is sensible to regard the numbers in our spreadsheet as the end-result of a complex random process. One of the statistician’s goals is to model this process and explain how the data turned out the way they did. Part of this job involves modeling the true underlying science, and another part involves modeling the errors introduced in the measurement process. Tricky stuff!
Our mathematical model of data analysis
Since the data in our spreadsheet are random, we model them on the blackboard as realizations of random variables. Y’know, those things we’ve been studying for two months. To keep it simple, on a first pass we model the data as independent and identically distributed (iid) from some shared distribution \(P_0\):
\[ X_1,\,X_2,\,X_3,\,...,\,X_{n-1},\,X_n\overset{\text{iid}}{\sim}P_0. \]
Again, if you keep studying statistics, you learn how to relax the iid assumption, which is often bogus.
Classical statistical inference
Parametric statistics
Statisticians have found that it is convenient to use the tolls and language of probability to get a handle on the variation that we observe in real-world data.
This assumption is so foundational to statistics that it often goes unremarked upon and recedes into the background, but it is indeed an assumption. Not every set of numbers you would seek to extract patterns from is random, but this is the first assumption a statistician makes.
By the time data arrive in our spreadsheet, they are a fixed set of numbers. But how did they get there?
We use the language and tools of probability to model variation. Even if the variation in our datasets is not literally the result of random forces, it may be convenient to proceed as if it is.
In this class, our data will be numerical.
Yada yada data science yada yada extract patterns and learn about the world.
goal: typical behavior - mean goal: distribution - plot histogram goal: learn relationships - line of best fit
These are all examples of statistical methods. A theoretical statistician
Do these methods actually work? Do they do what we want them to do? Do they deliver what they promise to deliver? Under what conditions? And before we can answer those questions, we have to back up and ask a more fundamental one: what does it even mean for a statistical method to “work”? These are all theoretical questions, and mathematical statisticians use the tools of probability theory to answer them.
In American government, there are legislative, judicial, and executive branches. In music, there is melody, harmony, and rhythm. Similarly, we organize statistical theory into three related areas of inquiry:
- (point estimation)
- (interval estimation)
- (hypothesis testing) can the data distinguish between
Point estimation
Measuring the quality of an estimator
Loss functions
\[ E[L(\hat{\theta}_n,\,\theta_0)] \]
Mean squared error
\[ \begin{aligned} \text{MSE}(\hat{\theta}_n,\,\theta) &= E[(\theta-\hat{\theta}_n)^2] \\ &= E[(\underbrace{\theta-E(\hat{\theta}_n)}_{\text{keep together}}+\underbrace{E(\hat{\theta}_n)-\hat{\theta}_n}_{\text{keep together}})^2] && \text{add zero} \\ &= E\left[\underbrace{(\theta-E(\hat{\theta}_n))^2}_{\text{constant}}+\underbrace{2(\theta-E(\hat{\theta}_n))}_{\text{constant}}\underbrace{(E(\hat{\theta}_n)-\hat{\theta}_n)}_{\text{random}}+\underbrace{(E(\hat{\theta}_n)-\hat{\theta}_n)^2}_{\text{random}}\right] && \text{FOIL} \\ &= (\theta-E(\hat{\theta}_n))^2+ 2(\theta-E(\hat{\theta}_n)) E\left[E(\hat{\theta}_n)-\hat{\theta}_n\right] + E[(E(\hat{\theta}_n)-\hat{\theta}_n)^2] && \text{linearity} \\ &= (\theta-E(\hat{\theta}_n))^2+ 2(\theta-E(\hat{\theta}_n)) \left[E(\hat{\theta}_n)-E(\hat{\theta}_n)\right] + E[(E(\hat{\theta}_n)-\hat{\theta}_n)^2] && \text{linearity again} \\ &= (\theta-E(\hat{\theta}_n))^2+ 2(\theta-E(\hat{\theta}_n)) \cdot 0 + E[(E(\hat{\theta}_n)-\hat{\theta}_n)^2] \\ &= (\theta-E(\hat{\theta}_n))^2 + E[(E(\hat{\theta}_n)-\hat{\theta}_n)^2] \\ &= \text{bias}(\hat{\theta}_n,\,\theta)^2 + \text{var}(\hat{\theta}_n). \end{aligned} \]
Classic bullseye analogy