Chapter 2 What is statistics?
The word “statistics” is confusingly used in numerous ways. Statistics can refer to the field of mathematics or the analyses used by that field. A statistic is a quantitative summary of data. In other words, a statistic is a number that describes something, usually multiple observations of something. A mean is an example of a statistic.
Statistical analyses are helpful because they allow us to understand the world on a deeper level. Whereas we are not entirely capable of looking at a spreadsheet of numbers and noticing trends and relationships, the process of statistical analysis can do just that.
A helpful way of understanding statistical analysis is to look at it as model-building. Imagine a model train. The model train is different from the real train it was designed after: it’s smaller, made of different materials, less detailed. But they also share a number of similarities: shape, colors, proportions. The point of the model train is not for it to perfectly replicate the real train. Rather, it serves as a simplified version where specific details are retained and others are tossed away.
Statistical analysis is mathematical modeling. We take the real world and assign numbers to certain things. In doing so, we are creating a vast simplification and ignoring many other details. However, that simplification allows us to do a number of things. It’s a constant trade-off between accuracy and simplicity. This is captured well in the following quote:
“All models are wrong, but some are useful.”
—George Box
The point of statistics is not perfection. It’s utility.
2.1 Statistics in Psychology
Psychology involves, the vast majority of the time, the study of people. The inherent problem is that such study is messy and prone to error. This is why the field of psychology has been criticized since its conception. However, statistics is like the housekeeper who cleans up the mess before guests arrive. With it, psychology becomes more rigorous and scientific.
One way we can think about the mess involved in the study of psychology is the uncertainty it creates. For example, when individuals fill out surveys, we don’t know with 100% certainty that they are being truthful. Unfortunately, if individuals lie in our studies, that takes away from our ability to draw meaningful conclusions from our results. Statistics uses mathematics and probability to quantify that uncertainty (see Hypothesis Testing).
2.2 Populations and Samples
Information can generally come from two different places: either a population (such as all undergraduate students) or a sample (a selection of 100 undergraduate students). In many cases, population are too vast to gather information from each person. It would take a heck of a lot of time, money, and resources to gather information from all undergraduate students in the world.
What we must often do is settle for a subset of those individuals that we are interested in. So, we choose just 100 undergraduate students. And what we might find is that nearly all of them slept, on average, two hours the previous night. That’s wild! We must have a serious problem on our hands!
But wait! That was only 100 students and it just so happens that they all had an exam that morning. So, they stayed up pretty late studying. Can we generalize to all undergraduates? NO! In reality, undergraduates average six hours of sleep per night. Not ideal, but hey, at least it’s something.
Remember how a statistic (like the mean) is used to describe things? Well, that’s partially true. A statistic can be used to describe a sample. When describing a population, the terminology changes. Instead, a parameter can be used to describe a population. Use the alliteration to help you remember (sample = statistic, population = paramter).
The mean can actually be a population parameter as well as a sample statistic. When studying our sleepy undergraduates, their average of two hours of sleep per night was a sample statistic. The reality, that undergraduates average six hours of sleep per night, was a population parameter (though remember that we rarely know true parameters). They were pretty different!
When we are making generalizations about populations from samples, we always run the risk of making mistakes. Let that both be a reason why statistical analyses are useful and why we should be cautious with them. Statistical analysis can help us quantify and account for our errors, but it can also yield errors itself. Kinda meta, right? It can be tricky to talk about. That’s part of why we’ve had hundreds of years of debate on the subject1.
2.3 Probability
We use probability to quantify our uncertainty (uncertainty can come from errors or just variability). It’s also worth knowing that there have been decades of ongoing debate regarding what probability truly means2. Probabilities can be said to measure tendencies or long-term frequencies. Saying that something has \(.80\) probability implies that, over time, it will occur in \(8/10\) instances. Some translate it as an “\(80\)% chance.” Regardless, these debates will not likely influence the use of probabilities at beginner levels of statistical analysis.
Probability notation follows a simple format: \(P(X) = p\) where \(P\) indicates probability, \(X\) indicates what the probability refers to, and \(p\) is the value of the probability. For example, the probability that the sun will rise tomorrow can be written as \(P(sun\ rising) = .99\).
We also talk about conditional probabilities, which refer to probabilities given some condition. These are written similarly: \(P(X\ |\ Y) = p\). For example the probability that you are in a statistics course given that you are reading this can be written as \(P(You\ are\ in\ statistics\ |\ You\ are\ reading\ this) = 1.00\). The given is the condition, and is written after the vertical bar \(|\) in the parentheses.
2.4 Notation
To succeed in statistical analysis, you need to be familiar with the notation used. Here are some common symbols:
Symbol | Meaning | Example |
---|---|---|
\(\large \bar X\) | Mean | \(\large \bar X = \frac{X_1 + X_2 + X_3}{3}\) |
\(\large \sum X\) | Sum | \(\large \sum X = X_1 + X_2 + X_3\) |
\(\large \hat X\) | Estimate | \(\large \hat s = \sigma\) |
Subscripts are often used to identify specific things when there are more than one of that thing. You will find that \(H\) stands for hypothesis, and the subscripts on \(H_0\) and \(H_1\) tell us which hypothesis. Further, \(\bar X_a\) and \(\bar X_b\) are both means, but refer to different means.
2.5 Controversy
Statistical analysis is not without its faults. Over the years a number of statistics have actually led individuals farther from the truth.