Topic 4 - The Normal Distribution and Sampling Variability

Download or print notes to PDF

If you’d like to export this presentation to a PDF, do the following

  1. Toggle into Print View using the E key.
  2. Open the in-browser print dialog (CTRL/CMD+P)
  3. Change the Destination to Save as PDF.
  4. Change the Layout to Landscape.
  5. Change the Margins to None.
  6. Enable the Background graphics option.
  7. Click Save.

This feature has been confirmed to work in Google Chrome and Firefox.

The Normal Distribution

Let’s start with an example

In a study on driving reaction time, researchers record the time (in seconds) that it takes study participants to react to the brake lights of a vehicle traveling in front of them.

A histogram of the recorded data is shown below.

Describe the shape, center, and spread of the distribution.

01:00

Normal Distribution

  • When to use:

Useful when modeling a continuous random variable that has a bell-shaped distribution.

  • Parameters of the distribution:

\(\mu\): mean - determines the center of the distribution

\(\sigma\): standard deviation - determines the spread of the distribution

  • Probability Density Function:

\[f(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] for \(x\) in \((-\infty, \infty)\)

 

  • Expectation: \(E(X) = \mu\)

  • Variance:\(Var(X) = \sigma^2\)

Example: Driving Reaction Time

Standard Normal Distribution

The Standard Normal Distribution is a Normal distribution with mean \(\mu=0\) and standard deviation \(\sigma=1\).

\[Z \sim N(0, 1)\]

Standardizing Any Normal Random Variable

For a Normal random variable, \(X\), a z-score represents the number of standard deviations any observation \(x\) is from the mean.

\[z = \frac{x-\mu}{\sigma}\]

Example: Driving Reaction Time

The driving reaction time study suggests that reaction times to break lights are Normally distributed with a mean of 1.25 seconds and standard deviation of 0.46.

Consider a driver who reacted in 0.79 seconds. How many standard deviations away from the mean is this driver’s reaction time?

Calculate the z-score!

00:45

\(z=\)\(\frac{x-\mu}{\sigma}\)\(=\frac{0.79-1.25}{0.46}\)\(=-1\)

 

The driver’s reaction time is 1 standard deviation less than (i.e., faster than) the average driver in the study.

Example: Driving Reaction Time

The driving reaction time study suggests that reaction times to break lights are Normally distributed with a mean of 1.25 seconds and standard deviation of 0.46.

What is the probability of a randomly selecting a driver with a reaction time less than 1 second?

\[P(X < 1) = \int \limits_{-\infty}^{1} \frac{1}{\sqrt{2\pi (0.46^2)}}e^{-\frac{(x-1.25)^2}{2(0.46^2)}} dx\]

 

We cannot solve this analytically - we must use R!

R Demonstration

 

Normal Distribution

 

\(F(x) = P(X \leq x)\):

pnorm(q, mean, sd, lower.tail = TRUE)

 

\(p^{th}\) percentile:

qnorm(p, mean, sd, lower.tail = TRUE)

Sampling Distributions

Sampling Distributions Simulation

Please do the following:

Answer the one question in the google form that can be accessed in any of the following ways:

  • Typing the following URL into your browser. The URL is case sensitive. https://beav.es/W4_survey

  • Find the Week 4 Survey link on Canvas in the Quick Links module

  • Scan the QR code

Key Concepts

Inferential Statistics

  • Recall that inferential statistics use information from a sample to estimate or test characteristics from a population of interest.

  • Typically, we calculate a point estimate from the sample as our best guess of the parameter of interest.

    • Naturally, our best guess for the population mean, \(\mu\), from a sample is the sample mean, \(\overline{x}\).

    • Our best guess for the population proportion, \(p\), is the sample proportion, \(\hat{p}\).

Example

As part of a quality control process for computer chips, an engineer at a factory randomly samples 212 chips during a week of production to test the current rate of chips with severe defects. She finds that 27 of the chips are defective.

The information above describes the outcome of a single sample. Suppose the true proportion of defective chips at this factory is \(p=0.1\).

Participation Question 📊

The value \(p=0.1\) is the

a. population parameter

b. sample statistic

Answer the question at

PollEv.com/erinhowardstats

Sampling Variability

Even when robust sampling schemes are used, different samples will yield different point estimates.

Population

Sample 1

Sample 2

Sample 3

\(\hat{\theta}_1\) \(\hat{\theta}_2\) \(\hat{\theta}_3\) \(\hat{\theta}\) represents a generic point estimate.

Distributions of Inference

Population Distribution

Distribution of the entire collection of interest.

SamplED Distribution

Distribution of \(n\) observations obtained from a single sample.

SamplING Distribution

Distribution of a sample statistic, such as \(\overline{x}\) or \(\hat{p}\), from repeated samples of size \(n\) from the population.

Example

As part of a quality control process for computer chips, an engineer at a factory randomly samples 212 chips during a week of production to test the current rate of chips with severe defects. She finds that 27 of the chips are defective.

The information above describes the outcome of a single sample. Suppose the true proportion of defective chips at this factory is \(p=0.1\).

Participation Question 📊

Suppose we were able to collect every possible sample of size 212 from the population of chips produced.

The distribution of the sample proportions of defective chips is the

a. population distribution.

b. sampled distribution.

c. sampling distribution.

Answer the question at

PollEv.com/erinhowardstats

Sampling Distributions

 

If we can’t observe the sampling distribution in real-world applications, why do we care about it?

 

Understanding the sampling distribution of commonly used statistics, such as \(\overline{x}\) and \(\hat{p}\), allows us to quantify the uncertainty in our point estimates.

Sample Size & Sampling Variability

  • The variability of the point estimate is called the standard error.

  • The standard error is the standard deviation of the sampling distribution.

  • As \(n\) increases, the standard error of the point estimate decreases.

Central Limit Theorem

When observations are independent and the sample size, \(n\), is sufficiently large, the central limit theorem states that the distributions of \(\hat{p}\) and \(\overline{x}\) are approximately Normal.

The sample size conditions (“sufficiently large”) and the details of these normal distributions differ for \(\hat{p}\) and \(\overline{x}\).

Sample Proportion, \(\hat{p}\)

\[\hat{p}\sim N\bigg(p, \sqrt{\frac{p(1-p)}{n}}\bigg)\] where \(p\) represents the population proportion

Sample Mean, \(\overline{x}\)

\[\overline{x}\sim N\bigg(\mu, \frac{\sigma}{\sqrt{n}}\bigg)\] where \(\mu\) and \(\sigma\) represent the population mean and standard deviation, respectively.

Central Limit Theorem

Sampling Distribution of Sample Proportion, \(\hat{p}\)

Central Limit Theorem

Sampling Distribution of Sample Mean, \(\overline{x}\)

Example

As part of a quality control process for computer chips, an engineer at a factory randomly samples 212 chips during a week of production to test the current rate of chips with severe defects. She finds that 27 of the chips are defective.

The information above describes the outcome of a single sample. Suppose the true proportion of defective chips at this factory is \(p=0.1\).

Participation Question 📊

Suppose we were able to collect every possible sample of size 212 from the population of chips produced.

The distribution of the sample proportion of defective chips is

Answer the question at

PollEv.com/erinhowardstats

CLT - Sample Size Conditions

The sample size conditions needed to apply the Central Limit Theorem differ depending on the statistic.

Sample proportion, \(\hat{p}\)

For the CLT to apply to the distribution of the sample proportion, we need the following sample size conditions to be met:

  • \(np \geq 10\)

  • \(n(1-p) \geq 10\)

Sample mean, \(\overline{x}\)

Use the sample size and observe the shape of the sampled distribution to determine if the sample size is sufficiently large:

  • If \(n\geq 30\), we can typically assume the sampling distribution of \(\overline{x}\) is approximately Normal and the CLT applies.

  • If \(n < 30\), we need to look at the sampled distribution. If there are no clear outliers or strong skewness in the sampled data, we can assume the sampling distribution of \(\overline{x}\) is approximately Normal and the CLT applies.

If the sample size conditions aren’t met, we cannot apply the results of the CLT.

Example

As part of a quality control process for computer chips, an engineer at a factory randomly samples 212 chips during a week of production to test the current rate of chips with severe defects. She finds that 27 of the chips are defective.

The information above describes the outcome of a single sample. Suppose the true proportion of defective chips at this factory is \(p=0.1\).

Participation Question 📊

In the previous participation question, we stated that the distribution of \(\hat{p}\) was Normal, but we didn’t check the sample size conditions before doing so! Are the sample size conditions met in this scenario in order to apply the central limit theorem?

Answer the question at

PollEv.com/erinhowardstats

More Practice!

 

Want more practice with this material? An optional activity is available. Find Practice with the Central Limit Theorem in the Week 4 module.