mean
standard deviation
Author

Josef Fruehwald

Published

September 25, 2024

Mean as a model

Random values setup
set.seed(2024)
tibble(
  x = rnorm(10) * 10
) ->
  rand

Let’s say I gave you these values on a number line:

And I say to you

Guess

Guess the next value that’s going to appear in this data series. The person with the smallest absolute difference between their guess and the actual next value wins.

What’s your strategy?

All models are wrong

There are a bunch of principled and unprincipled strategies you could take. Two are:

  • The next number is probably going to be like one of the numbers that have already appeared, so I’ll pick one of them randomly.

  • The next number probably won’t be too far away from the mean of the numbers that have appeared so far, so I’ll calculate it and use that.

We can see how well these two will do for just one case

tibble(
  rand_guess = sample(rand$x, 1),
  mean_guess = mean(rand$x),
  next_x = rnorm(1)*10,
  
  rand_diff = abs(rand_guess-next_x),
  mean_diff = abs(mean_guess-next_x)
) |>
  round(digits = 2)

The next new value was nothing like either new guess. But if we play this out over 5,000 simulations, it looks like guessing that the next value will be the mean does better than choosing a random value from the original data series.

The mean ≠ “typical” ≠ “possible”

There are many cases where the mean of a data set will never be typical of the individual data points, but it would still wind up being a good value to guess what the next data point will be. For example, any extremely bimodal, or binary data.

Binary data sample:

Guessing the next value using either the mean, or a random [0,1] 5,000 times:

This is why quantitative descriptions of the “average consumer” or “average voter” might wind up not characterizing many or any actual people.

Mean and Standard Deviation

name R function population symbol sample symbol
mean mean() \(\mu\) \(\bar{y}\)
standard deviation sd() \(\sigma\) \(s\)
variance var() \(\sigma^2\) \(s^2\)

Mean

A mathematical definition of the mean is:

\[ \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \]

The two pieces of the mathematical formula can get translated into R functions like this:

\(\sum_{i=1}^nx_i\)

sum(x)

\(\frac{1}{n}\)

1/length(x)

or

x_bar = sum(x)/length(x)
rand |> 
  summarise(
    total = sum(x),
    n = n(),
    mean1 = total/n,
    mean2 = mean(x)
  ) |> 
  round(digits = 2)

“Location Parameter”

If you add a a constant value to all numbers in a data series, the mean of the data series will also shift by the same amount.

mean(rand$x)
[1] 1.641674
mean(rand$x + 10)
[1] 11.64167
mean(rand$x - 20)
[1] -18.35833

Sometimes we’ll flip this reasoning and think about how adding a value to the mean can shift the location of the data.

For this reason, the mean is sometimes called a “location parameter.”

Standard Deviation

The standard deviation is a parameter describing how “spread out” a data set is. You could think of some others. For example: how big the difference is between the largest and smallest value.

max(rand$x) - min(rand$x)
[1] 25.1723

But, that’s putting the pressure of describing the whole dataset on just two of its values. Ideally every data point would contribute in some way.

Building up to standard deviation

Let’s build up to the full mathematical definition of the standard deviation.

“Spread out-ed-ness”

First, we need some way to define, for each data point, how spread out it is with respect to the entire data series. The standard deviation uses each data point’s distance from the mean for this:

\[ x_i -\bar{x} \]

  • The mean describes some kind of central location in a data series.

  • So, each data point’s distance from the central point describes its “spread-out-ed-ness”.

Total spread-out-ed-ness

If we tried getting either the total or average spread-out-ed-ness from just \(x_i-\bar{x}\), we’ll run into a problem:

sum(
  rand$x - mean(rand$x)
) |> 
  round(digits = 2)
[1] 0

Because the mean describes a central point in a data series, when we get the distance of each data point from the mean, there’s going to be just as much data below the mean as above it.

rand |> 
  mutate(
    diff = x - mean(x),
    sign = sign(diff)
  ) |> 
  summarise(
    .by = sign,
    total_diff = sum(diff)
  ) ->
  pos_neg_diff
sign total_diff
+ 36.1
- −36.1

To deal with this, we’ll use a commonly recurring move you’ll see in stats:

When you need only positive values:

If you have some stats operation that returns both positive and negative values, but you need only positive values, square the results.

\[ \sum_{i=1}^n(x_i-\bar{x})^2 \]

sum(
  (rand$x - mean(rand$x))^2
) |> 
  round(digits = 2)
[1] 704.25

Average-ish spread-out-ed-ness

For reasons above our paygrade, instead of calculating the average spread-out-ed-ness by dividing by \(n\), we’ll divide by \(n-1\).

\[ s^2 = \frac{1}{n-1}\sum_{i-1}^n(x_i-\bar{x})^2 \]

If we stop here, we actually have the definition for the sample variance.

x <- rand$x

x_bar <- mean(rand$x)

total_diff <- sum((x-x_bar)^2)

(x_var <- total_diff / (length(x)-1))
[1] 78.25034
var(x)
[1] 78.25034

Getting back to the original scale

To get back to describing the spread-out-ed-ness on a scale similar to the original data, we’ll take the square root (undoing the squaring we did before) to get the sample standard deviation.

\[ s = \sqrt{s^2} \]

sqrt(x_var)
[1] 8.845922
sd(rand$x)
[1] 8.845922

“Scale Parameter”

If we multiply or divide every data point by a constant value, the standard deviation winds up being scaled to the same degree.

sd(rand$x)
[1] 8.845922
sd(rand$x * 10)
[1] 88.45922
sd(rand$x / 10)
[1] 0.8845922

Just like with the mean, we’ll sometimes flip this reasoning around and think about multiplying or dividing the standard deviation as adjusting the scale of the data.

For this reason, the standard deviation is sometimes called a “scale parameter”.

The Normal Distribution, Again

We can use the mean and standard deviation as just summary statistics that can be calculated for any data set. Depending on the data set, they may or may not be all that good for characterizing the typical range of values.

But, the special cases \(\mu\) and \(\sigma\) can be used to also mathematically define the normal distribution’s probability density function.

Plotting code
ggplot()+
  xlim(-3,3)+
  stat_function(
    fun = dnorm,
    geom = "area"
  )+
  annotate(
    x = -2,
    y = 0.2,
    label = expression(list(mu==0,sigma==1)),
    parse = T,
    geom = "text"
  )+
  scale_y_continuous(
    expand = expansion(c(0, 0.1))
  )+
  theme_no_y()

The full mathematical definition of the normal distributions’ probability density function is:

\[ f(x) = \frac{1} { \sqrt{2\pi\sigma^2} }e^{ -\frac{(x-\mu)^2}{2\sigma^2} } \]

If you really want to know why this is the formula, I’d recommend this video:

Sampling and Uncertainty

Given mean and standard deviation, we can generate random values according to the normal probability density function.

(samp_5 <- rnorm(5, mean = 67, sd = 4))
[1] 70.48869 61.00586 69.36986 67.77460 69.75400

We know that \(\mu\) of this distribution was 67, because we told it to be. But the sample mean probably isn’t.

mean(samp_5)
[1] 67.6786

The “noisiness” of our sample mean estimate will decrease as the sample size increases.

Simulation code
expand_grid(
 n = c(10, 100, 1000),
 sim = 1:1000
) |>
  mutate(
    .by = c(n, sim),
    mean = mean(
      rnorm(n, mean = 67, sd = 4)
    )
  )->
  sample_simulations
1
Set up to do 1,000 simulations of sample sizes 10, 100, and 1000.
2
For each sample size & simulation…
3
…generate n random values from the normal distribution, and get the mean.

Why?

\[ \frac{1}{n}\sum_{i=1}^n x_i = \sum_{i=1}^n\frac{x_i}{n} \]

  • As \(n \rightarrow\infty\), the contribution of each individual \(x_i\) to the sum decreases.

  • In small sample sizes, occasionally large samples to one side are less likely to be counterbalanced by occasionally large samples to the other side.

In smaller samples, each individual value has a larger influence on the mean estimate than in larger samples. We can look at this with the “Leave One Out” method, where we calculate

  • The mean of the sample

  • The mean of the sample with one value randomly left out

LOO code
expand_grid(
 n = c(10L, 100L, 1000L),
 sim = 1:1000
) |> 
  reframe(
    .by = c(n, sim),
    x = rnorm(n, mean = 67, sd = 4)
  ) |> 
  summarise(
    .by = c(n, sim),
    mean = mean(x),
    mean_loo = mean(sample(x, n()-1))
  ) |> 
  summarise(
    .by = n,
    low = min(mean-mean_loo),
    high = max(mean-mean_loo)
  )->
  loo_tbl
LOO effect on mean
n low high
10 -1.36 1.45
100 -0.12 0.13
1,000 -0.02 0.01
~normal(67,4)

“Standard Error”

The “Standard Error of the Mean” is a metric of the uncertainty, or the instability, of our estimate of the mean. It’s defined as:

\[ \frac{s}{\sqrt{n}} \]

  • As the standard deviation increases, so does the standard error.

  • As the sample size increases, the standard error decreases.

Standard Error Code
tibble(
  n = c(10, 100, 1000)
) |> 
  reframe(
    .by = n,
    x = rnorm(n, mean = 67, sd = 4)
  ) |> 
  summarise(
    .by = n,
    mean = mean(x),
    sd = sd(x),
    se = sd/sqrt(n())
  ) ->
  se_tbl
n mean sd se
10 66.5 2.8 0.9
100 67.7 4.3 0.4
1,000 67.1 4.1 0.1

A real example

Let’s grab the F1 data from one speaker in the Peterson & Barney data set.

data("pb52", package = "phonTools")

pb52 |> 
  filter(
    speaker == 3
  ) ->
  speaker3
speaker3 |> 
  summarise(
    mean = mean(f1),
    sd = sd(f1),
    se = sd/sqrt(n())
  )->
  params3
mean sd se
497.3 135.5 30.3

One good way to get a sense of how the Standard Deviation and the Standard Error relate to the mean is to overlay the range of mean±2sd and mean±2se on the plot.

  • The ±2sd range marks out where we expect the vast majority of the data to appear.

  • The ±2se range marks out where we expect estimates of the mean to appear.

While ±2se is useful for seeing the lower and upper bounds of the Standard Error distribution, we can also plot the full probability density function on top of the speaker’s data. Here are two different ways of visually presenting that information.

One way to think about this is that even though we have 1 mean and standard deviation estimate for the data from this speaker, there are a variety of normal distributions, some more probable than others, from which our data could have been sampled.

Sampling and sampling

Another thing to think about is how we’ve looked at one speaker’s data drawn from a larger data set of many speakers. Each individual speaker’s data could be summarized by taking the mean of their data, like in this example of speakers 3 and 60.

If we summarize all of the speakers’ data this way, and plot these means, we’ll see another distribution form.

We can think of this as a distribution over speakers, from which individual speaker means are drawn.

Cascading uncertainties

Remember that for each individual speaker, we have some uncertainty about their actual \(\mu\). That uncertainty should cascade upwards a little bit to uncertainty about the distribution over speakers.

But at the same time, if we think speakers’ means are drawn from an overarching distribution, that should help reduce our uncertainties about any one speaker’s \(\mu\), because it should be a plausible value drawn from the population.

Being able to fit models that capture the trading back and forth of uncertainties in nested data is one of the main goals of this course.

Bringing it all together into one big picture:

  • We have a variety of distributions from which individual speakers’ \(\mu\) are drawn.

  • In turn, we also have a variety of distributions from which individual speakers’ data was drawn.

Reuse

CC-BY 4.0

Citation

BibTeX citation:
@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {What Is a Mean?},
  date = {2024-09-25},
  url = {https://lin611-2024.github.io/notes/meetings/2024-09-25_mean-sd.html},
  langid = {en}
}
For attribution, please cite this work as:
Fruehwald, Josef. 2024. “What Is a Mean?” September 25, 2024. https://lin611-2024.github.io/notes/meetings/2024-09-25_mean-sd.html.