Topic 8: Analysis of Variance (ANOVA)

Download or print notes to PDF

If youโ€™d like to export this presentation to a PDF, do the following

  1. Toggle into Print View using the E key.
  2. Open the in-browser print dialog (CTRL/CMD+P)
  3. Change the Destination to Save as PDF.
  4. Change the Layout to Landscape.
  5. Change the Margins to None.
  6. Enable the Background graphics option.
  7. Click Save.

This feature has been confirmed to work in Google Chrome and Firefox.

Motivating Exercises

In each of the side-by-side boxplots below, youโ€™ll see data sampled from three different populations. The red dot in each plot corresponds to the sample mean.

Example 1

๐Ÿ“Š In Poll Everywhere, comment on whether you think the means of the populations from which the above three samples came are the same, similar, or significantly different from one another.

QR code to PollEv.com/erinhowardstats

Answer the question at PollEv.com/erinhowardstats

Motivating Exercises

In each of the side-by-side boxplots below, youโ€™ll see data sampled from three different populations. The red dot in each plot corresponds to the sample mean.

Example 2

๐Ÿ“Š Again, in Poll Everywhere, comment on whether you think the means of the populations from which the above three samples came are the same, similar, or significantly different from one another.

QR code to PollEv.com/erinhowardstats

Answer the question at PollEv.com/erinhowardstats

Motivating Exercises

 

 

Analysis of Variance (ANOVA)

  • An ANOVA is a holistic procedure used to test whether there is evidence that at least one pair of populations have different means when comparing more than two populations.

  • Why is it called an Analysis of Variance if the goal is to compare means?

    • In an ANOVA, weโ€™ll compare the variability across groups to the variability within groups.

Notation

  • \(k=\) number of groups (i.e. number of populations of interest)

  • \(n=\) overall sample size (i.e. size of all samples combined)

  • \(\overline{x}=\) overall sample mean (i.e. mean of all observations ignoring groups)

  • \(n_i=\) sample size of the \(i\)th group

  • \(\mu_i=\) population mean of the \(i\)th group

  • \(\overline{x}_i=\) sample mean of the \(i\)th group

  • \(s_i=\) sample standard deviation of the \(i\)th group

ANOVA Procedure Steps

 

1. Question of interest

Are there differences in the means of the multiple populations?

 

2. Parameters of interest

\(\mu_1, \mu_2, ..., \mu_k\), the population means of the \(k\) groups

Example Problem ๐Ÿšฒ ๐Ÿš— ๐Ÿšถโ€โ™€๏ธ

Suppose we are interested in determining if there is a significant difference in the average number of hours of sleep students get per night based on whether they bike, drive, or walk to campus. Weโ€™ll consider a significant result using the \(\alpha = 0.05\) significance level.

 

Question of interest: Is there is a difference in the average number of hours of sleep students get per night based on whether they drive, bike, or walk to campus?

 

Parameters of interest:

\(\mu_B\), average amount of sleep for all students who bike

\(\mu_D\), average amount of sleep for all students who drive

\(\mu_W\), average amount of sleep for all students who walk

ANOVA Procedure Steps

 

3. Null and Alternative Hypotheses

 

\(H_0: \mu_1 = \mu_2 = ... = \mu_k\)

 

\(H_A:\) At least one mean is different

Participation Question ๐Ÿ“Š

 

Write the null hypothesis for our bike, drive, walk example.

 

Answer the question at PollEv.com/erinhowardstats

QR code to PollEv.com/erinhowardstats

3. Null and Alternative Hypotheses

 

Example ๐Ÿšฒ ๐Ÿš— ๐Ÿšถโ€โ™€

 

\(H_0: \mu_B = \mu_D = \mu_W\)

 

\(H_A:\) At least one transportation group gets a different number of hours of sleep per night, on average, than the others.

ANOVA Procedure Steps

4. The \(F\) test statistic

In order to answer our question of interest, we must compare the variability between groups to the variability within groups.

Mean Square Between Groups (MSG)

  • Measures the average variability between groups

\[MSG = \frac{1}{k-1}\sum \limits_{i=1}^k n_i (\overline{x}_i - \overline{x})^2\]

Mean Square Error (MSE)

  • Measures the average variability within groups

\[MSE = \frac{1}{n-k}\sum \limits_{i=1}^k (n_i-1)s_i^2 \]

The test statistic is the ratio of the average between group variability to the average within group variability \[F = \frac{MSG}{MSE}\]

ANOVA Procedure Steps

5. Determine the Null Distribution

There are two conditions that need to be met in order to assume the upcoming null distribution:

  1. The samples sizes must be sufficiently large.
  • If \(n_i\geq 30\), we can move forward.

  • If any of the sample sizes are less than 30, we need to look at the sampled distribution(s) of the small sample(s). If there are no clear outliers or strong skewness in the sampled data, we can move forward.

  1. We need to be able to assume constant variance across the groups.
  • To assess this condition, we should use the sampled distributions and make a judgement as to whether the standard deviations are similar between samples.

If the above conditions are met, under the null hypothesis, the test statistic, \(F = \frac{MSG}{MSE}\), follows an F distribution with \(k-1\) and \(n-k\) degrees of freedom.

ANOVA Procedure Steps

5. Determine the Null Distribution

If the previously mentioned conditions are met, under the null hypothesis, the test statistic, \(F = \frac{MSG}{MSE}\), follows an F distribution with \(k-1\) and \(n-k\) degrees of freedom.

  • The \(F\) distribution is right skewed whose support is \((0, \infty)\)

  • Its shape is defined by two values:

    • The numerator degrees of freedom, \(k-1\)

    • The denonminator degrees of freedom, \(n-k\)

  • Denoted: \(F_{k-1, n-k}\)

Example Problem ๐Ÿšฒ ๐Ÿš— ๐Ÿšถโ€โ™€๏ธ

Data were collected on 250 randomly selected students. The summary statistics and sampled distributions for these data are shown below.

 

 

Sample Mean Sample Standard Deviation Sample Size
Bike 6.93 hr 1.40 hr 33
Drive 6.77 hr 1.02 hr 37
Walk 6.87 hr 1.12 hr 180

Three histograms showing the distributions of sleep per night for bike, drive, and walk samples. The bike and walk distributions are left skewed. The drive distribution is approximately symmetric.

Participation Question ๐Ÿ“Š

Assess the sample size condition needed to perform an ANOVA F test for our bike, drive, walk example.

 

Answer the question at PollEv.com/erinhowardstats

QR code to PollEv.com/erinhowardstats

Three histograms showing the distributions of sleep per night for bike, drive, and walk samples. The bike and walk distributions are left skewed. The drive distribution is approximately symmetric.

Participation Question ๐Ÿ“Š

Consider the constant variance assumption needed to perform an ANOVA F test. Do you think the condition is met for our bike, drive, walk example? Why or why not?

 

Answer the question at PollEv.com/erinhowardstats

QR code to PollEv.com/erinhowardstats

 

 

 

Sample Mean Sample Standard Deviation Sample Size
Bike 6.93 hr 1.40 hr 33
Drive 6.77 hr 1.02 hr 37
Walk 6.87 hr 1.12 hr 180

ANOVA Procedure Steps

 

6. Using the sampled data and the alternative hypothesis, determine what values would be considered โ€œas or more extremeโ€ than the observed sampled statistic.

 

Any ratio of \(MSG\) to \(MSE\) greater than the calculated \(F\) test statistic would be considered โ€œas or more extremeโ€ than our observed data.

A number line that begins at zero on the left. A mark for F is indicated. Every value to the right of F on the number line is highlighted.

ANOVA Procedure Steps

7. Calculate the p-value

Recall that the p-value represents the probability of observing data as or more extreme than our current dataset according to the alternative hypothesis, if the null hypothesis were true.

An F distribution with the F statistic indicated. The area under the curve to the right of the F statistic is shaded.

In an ANVOA F test, the p-value is always the area under the null distribution curve to the right of the F test statistic.

R code: 1-pf(F, k-1, n-k)

Example Problem ๐Ÿšฒ ๐Ÿš— ๐Ÿšถโ€โ™€๏ธ

In practice, the ANOVA calculations (e.g., MSG, MSE, F statistic, p-value) wonโ€™t be computed by hand. When we perform an ANOVA F test, weโ€™ll use R to complete the calculations.

 

Typically results are displayed in an ANOVA table.

df Sum of Squares Mean Squares F p-value
Transport 2 0.450 0.225 0.171 0.843
Residuals 247 324.311 1.313 โ€“ โ€“

ANOVA Procedure Steps

8. Make a conclusion

Write a 2-part conclusion (weโ€™ll exclude a single point estimate and confidence interval). The conclusion should be written in the context of the problem and contain the following components:

  1. A statement for the strength of evidence in favor the alternative hypothesis.

  2. Whether to reject or fail to reject the null hypothesis.

Example Problem ๐Ÿšฒ ๐Ÿš— ๐Ÿšถโ€โ™€๏ธ

Using the ANOVA output, write a conclusion for this test using a \(\alpha=0.05\) significance level.

df Sum of Squares Mean Squares F p-value
Transport 2 0.450 0.225 0.171 0.843
Residuals 247 324.311 1.313 โ€“ โ€“

There is no evidence to suggest that the average amount of sleep students get per night differs based on how they travel to campus (biking, driving, or walking). At the 0.05 significance level, we fail to reject the null hypothesis that the average amount sleep students get per night is equal across all three transportation types.

01:30