
If you’d like to export this presentation to a PDF, do the following
This feature has been confirmed to work in Google Chrome and Firefox.


✍️ Describe the relationship between illiteracy rate and life expectancy shown in thescatterplot above.
🗣 Describe the relationship between high school graduation rate and income using the scatterplot above.
When discussing bivariate relationships, it is common to treat one variable as the explanatory and one as the response.
| Explanatory Variable | Response Variable |
|---|---|
|
|
Answer Question 1 in the classroom tab on Top Hat using the scatterplot below.
The correlation coefficient, \(R\), measures the strength of a linear association between two quantitative variables.
The correlation between two quantitative variables will always be a value between -1 and 1.

Answer Question 2 in the classroom tab on Top Hat using the scatterplots below.
Simple linear regression is the statistical method for fitting a line to describe the relationship between two quantitative variables.
We want to find a line of the form \(\hat{y} = b_0 + b_1 x\)

What characteristics would the “line of best fit” have?
Open the link: https://beav.es/cTp (also found in the Week 9 module on Canvas called “SLR Demo”)
Try to find the line that best fits the data by adjusting the sliders below \(b_0\) and \(b_1\).
🗣 Compare your values of \(b_0\) and \(b_1\) to somewhere nearby. Discuss how you chose the values of \(b_0\) and \(b_1\).
02:00
The residual of an observation is the difference in the observed response, \(y_i\), and the predicted response based on the model fit, \(\hat{y}_i\).
\(e_i = y_i - \hat{y}_i\)

The least squares regression line (LSRL) is calculated by finding the line that minimizes the sum of the squared residuals.
When fitting the LSRL, we generally require:
Linearity - the data should indicate a linear trend
Nearly normal residuals - the residuals should be approximately normally distributed
Constant variability - the variability of the points around the line should be roughly constant
Independent observations
If the above conditions are met, we can fit the LSRL using the following estimates \(b_1 = \frac{s_y}{s_x}R\) and \(b_0 = \overline{y} - b_1\overline{x}\)
In practice, we compute these estimates using R. Coming up…
\[ \hat{y} = b_0 + b_1 x\]
Interpreting the intercept estimate, \(b_0\): the expected value of the response variable when the explanatory variable is equal to 0.
Interpreting the slope estimate, \(b_1\): For a one unit increase in the explanatory variable, we expect the response to change by \(b_1\).
Answer Class 18 - Question 1 in the classroom tab on Top Hat using the fitted LSRL.
\[\hat{y} = 72.181 - 1.146x\] where \(\hat{y}\) is the predicted average life expectancy and \(x\) represents illiteracy rate.
Answer Class 18 - Question 2 in the classroom tab on Top Hat using the fitted LSRL.
\[\hat{y} = 72.181 - 1.146x\] where \(\hat{y}\) is the predicted average life expectancy and \(x\) represents illiteracy rate.
Recall that the residual is difference in the observed response variable and the predicted response based on the model fit: \[e_i = y_i - \hat{y}_i\]
Answer Class 18 - Question 3 in the classroom tab on Top Hat using the fitted LSRL.
\[\hat{y} = 72.181 - 1.146x\] where \(\hat{y}\) is the predicted average life expectancy and \(x\) represents illiteracy rate.
Recall that to fit the LSRL, we need four conditions to hold (see Least Squares Regression Line slide).
Some of these conditions can be easily checked using a residual plot.

Ideally, when fitting the LSRL, we see no obvious patterns in the residual plot.
If a pattern is visible, it might be an indication that one or more of the LSRL conditions are violated.


Linearity violated
Nearly normal residuals violated
Constant variability violated
Independence violated
Answer Class 18 - Question 4 in the classroom tab on Top Hat using the fitted LSRL and corresponding residual plot.
\[\hat{y} = 72.181 - 1.146x\] where \(\hat{y}\) is the predicted average life expectancy and \(x\) represents illiteracy rate.
# Open the tidyverse library
library(tidyverse)
# Import the dataset, first need to download the data from Canvas
state_30 <- read_csv(file.choose())
# Create a scatterplot of the Illiteracy and LifeExp variables
ggplot(data = state_30, aes(x = Illiteracy, y = LifeExp)) +
geom_point(color = "purple", size = 3) +
labs(y = "Life Expectancy (years)",
x = "Illiteracy Rate (% of population)",
title = "Illiteracy Rate vs. Life Expectancy",
subtitle = "for 30 US States in 1970") +
theme(axis.title = element_text(size = 18)) +
theme_bw() +
stat_smooth(method = "lm",
formula = y ~ x,
geom = "smooth",
se = FALSE,
color = "darkgreen")
## Calculate the correlation between illiteracy rate and life exp
state_30 %>% summarise(cor = cor(Illiteracy, LifeExp))
# Estimate intercept and slope for LSRL
lm(LifeExp ~ Illiteracy, data = state_30)