Topic 5 - Confidence Intervals

Example

From Class 8 Activity

As part of a quality control process for computer chips, an engineer at a factory randomly samples 212 chips during a week of production to test the current rate of chips with severe defects. She finds that 27 of the chips are defective.

The information above describes the outcome of a single sample. Suppose the true proportion of defective chips at this factory is \(p=0.1\).

Sampling distribution of \(\hat{p}\): The distribution of the sample proportion of defective chips from a sample of size 212 is approximately Normal with a mean of \(0.1\) and a standard error equal to \(\sqrt{\frac{0.1(1-0.1)}{212}} = 0.021\).

Calculate the probability of collecting a sample of size \(n=212\) and observing a sample proportion of defective chips of 0.127 or less. That is, calculate \(P(\hat{p}<0.127)\).

pnorm(0.127, 0.1, 0.021) = 0.901

The probability of observing a sample proportion of defective components less than 0.127 is 0.901.

Calculate the probability of observing a sample proportion of defective chips that is within two standard errors of the true population proportion, \(p\).

\(P(0.058 < \hat{p} < 0.142) =\)

pnorm(0.142, 0.1, 0.021) - pnorm(0.058, 0.1, 0.021) = 0.954

Confidence Intervals of a Proportion

Point Estimate

The single value considered to be the “best guess” for the parameter of interest.

 

For example, if the parameter of interest is a population proportion, \(p\), the point estimate from a sample is the sample proportion, \(\hat{p}\).

Sampling Variability

A point estimate is rarely exactly equal to the population parameter and has some sampling variability associated with it.

 

The Central Limit Theorem can help us understand this variability for \(\hat{p}\) and \(\overline{x}.\)

Confidence Intervals

A confidence interval considers the sampling variability of the point estimate to provide a range of plausible values for the parameter of interest.

Sampling Distribution of \(\hat{p}\)

Central Limit Theorem for \(\hat{p}\): For a sufficiently large sample, \[ \hat{p} \sim N\bigg(p, \sqrt{\frac{p(1-p)}{n}}\bigg)\]

Standardizing the Sampling Distribution of \(\hat{p}\)

Recall that we can standardize any Normal random variable by subtracting its mean and dividing by its standard deviation: \[ \frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}} \sim \]

95% Confidence Interval

If the sampling distribution of \(\hat{p}\) is normal, then 95% of all possible point estimates will be within \(z^*\) standard errors of the population proportion \(p\).

How can we determine the value of \(z^*\)?

Sampling Distribution of \(\hat{p}\)

Standard Normal Distribution

Standard Normal Distribution

The qnorm() function in R is the inverse of pnorm(). We can provide the area under the Normal distribution curve to the left of the value on the distribution that we’re looking for.

\(z^*=\) qnorm(0.975, mean = 0, sd = 1)\(= 1.96\)

95% Confidence Interval

If the sampling distribution of \(\hat{p}\) is normal, then 95% of all possible point estimates will be within 1.96 standard errors of the population proportion \(p\).

95% Confidence Interval

If a point estimate, \(\hat{p}\) within 1.96 standard errors of \(p\) is selected, then the interval

\[ \hat{p} \pm 1.96 \sqrt{\frac{p(1-p)}{n}}\]

will capture the parameter \(p\).

 

We can be 95% confident that the parameter is captured by the above interval estimate.

 

5% of these “95% confidence intervals” won’t capture \(p\). In practice, we don’t know \(p\) (remember the goal here is to estimate it), so we won’t know if the interval we constructed is one of the “good ones.”

Constructing a Confidence Interval

 

All confidence intervals have the same general form:

 

point estimate \(\pm\) critical value \(\times\) standard error of point estimate

 

critical value \(\times\) standard error of point estimate is often referred to as the margin of error

Constructing a Confidence Interval for \(p\)

  1. Calculate the point estimate, \(\hat{p}\), from the observed sample.

 

  1. Identify the confidence level, \(CL\), and the error associated with this confidence level \(100-CL\).

 

  1. Determine the critical value, \(z^*\), by finding the \(CL + \frac{100-CL}{2}\) percentile on the Standard Normal Distribution:

qnorm(\((CL + \frac{100-CL}{2})\times \frac{1}{100}\), mean = 0, sd = 1)

 

  1. Calculate the standard error estimate from the observed sample: \(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

Constructing a Confidence Interval for \(p\)

 

 

\[\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]

Check the sample size

Recall that the construction of the confidence interval was based on the Central Limit Theorem for the sample proportion. The CLT requires that we have sufficiently large sample sizes, so we should only construct a confidence interval for \(p\) if

  1. \(n\hat{p} \geq 10\)

AND

  1. \(n(1-\hat{p}) \geq 10\)

Choosing a Confidence Level

The confidence level is chosen based on whether precision or accuracy is more desirable.

As we increase the confidence level, we lose precision but gain accuracy.

Interpreting a Confidence Interval

 

Template: We are ______% confident that the population _____________ of ____________________ is between __________ and __________, with a point estimate of __________.

  • Interpretation is always about the population parameter.

  • The confidence interval should not be interpreted as the probability of observing the parameter.

  • The confidence interval does not provide an interpretation corresponding to a single observation or single point estimate.

  • The confidence interval simply provides a range of plausible values for the parameter.

Example - Please open Class 9 Activity in TH 🎩

In a 2023 survey conducted by the Pew Research Center, 3,925 U.S. adults, out of the 10,329 randomly sampled, say they are very to somewhat likely to seriously consider an electric vehicle (EV) for their next vehicle purchase. We’ll consider these individuals “interested in an EV.”

Construct a 90% confidence interval for the true proportion of U.S. adults interested in an EV. Interpret the interval with context.

  1. \(\hat{p} = \frac{3925}{10329}\)

  2. CL = 90%, 100-CL = 10%

  3. \(z^*=\) qnorm(0.95, 0, 1) \(=1.645\)

  4. Standard error: \(\sqrt{\frac{\frac{3925}{10329}\bigg(1-\frac{3925}{10329}\bigg)}{10329}}\)

\[\frac{3925}{10329} \pm 1.645\sqrt{\frac{\frac{3925}{10329}(1-\frac{3925}{10329})}{10329}} = (0.372, 0.388)\]

Interpretation:

We are 90% confident that the population proportion of U.S. adults interested in an EV is between 0.372 and 0.388, with a point estimate of 0.38.

Confidence Intervals of a Mean

Motivating Example

 

Question of interest: What is the average mile time for all racers in the 10 Mile event in the 2017 Cherry Blossom Run?

 

Parameter of interest: \(\mu\), the average mile time for all racers in the 10-mile event

Example continued - Sampled Data

\(\overline{x} = 591.53\) seconds

\(\overline{x}\) is the point estimate for the average mile time for all runners in the 10 Mile event.

Example continued - Quantifying the Uncertainy in \(\overline{x}\)

  • The point estimate, \(\overline{x} = 591.53\) is our best guess for the population mean, \(\mu\); however, since this estimate came from a subset of the population, we need to address the possible error or uncertainty in this estimate.

  • To do this, we need to identify the sampling distribution of \(\overline{x}\).

  • The sample size is sufficiently large (\(n=45\)), so according to the central limit theorem, what is the sampling distribution of the sample mean, \(\overline{x}\)?

\(\overline{x} \sim N\)\(\bigg(\)\(\mu\),\(\frac{\sigma}{\sqrt{n}}\)\(\bigg)\)

Interlude - Standardizing the point estimate

 

\(\overline{x} \sim N\bigg(\mu, \frac{\sigma}{\sqrt{n}}\bigg)\)

 

\(z = \frac{\overline{x} - \mu}{\sigma/\sqrt{n}} \sim\)\(N(0,1)\)

t Distributions

 

  • When the population standard deviation, \(\sigma\) is unknown, we can use the sample standard deviation, \(s\), in its place; however, the standardized sampling distribution of \(\frac{\overline{x}-\mu}{s/\sqrt{n}}\) is not normal.

 

  • When \(n\) is sufficiently large, \(\frac{\overline{x}-\mu}{s/\sqrt{n}}\) follows a t distribution.

t Distribution vs. Standard Normal

t Distribution vs. Standard Normal

t Distributions

  • A t distribution is symmetric and centered at 0.

  • A t distribution is defined by its degrees of freedom.

  • The degrees of freedom determine the spread of the distribution.

  • As the degrees of freedom of increase toward \(\infty\), the t distribution approaches the Standard Normal distribution.

  • Degrees of freedom are determined by the sample size and number of unknown parameters.

Sampling Distribution of the Standardized Sample Mean

 

For sufficiently large \(n\), the distiribution of the standardized sample mean, \(\frac{\overline{x}-\mu}{s/\sqrt{n}}\) follows a t distribution with \(n-1\) degrees of freedom.

\[\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}} \sim t_{n-1} \]

Check the sample size

Recall that the construction of the confidence interval was based on the Central Limit Theorem for the sample mean. The CLT requires that we have sufficiently large sample sizes.

Use the sample size and observe the shape of the sampled distribution to determine if the sample size is sufficiently large:

  • If \(n\geq 30\), we can typically assume the sampling distribution of \(\overline{x}\) is approximately Normal and the CLT applies.

  • If \(n < 30\), we need to look at the sampled distribution. If there are no clear outliers or strong skewness in the sampled data, we can assume the sampling distribution of \(\overline{x}\) is approximately Normal and the CLT applies.

Confidence Interval for \(\mu\)

A confidence interval for \(\mu\) should be constructed when we want to estimate the population mean \(\mu\) based on an observed sample of \(n\) observations.

Constructing a Confidence Interval for \(\mu\)

  1. Calculate the point estimate, \(\overline{x}\), from the observed sample.

  2. Identify the confidence level, \(CL\), and the error associated with this confidence level \(100-CL\).

  3. Determine the critical value, \(t^*\), by finding the \(CL + \frac{100-CL}{2}\) percentile on the t distribution with \(n-1\) degrees of freedom:

qt(\((CL + \frac{100-CL}{2})\times \frac{1}{100}\), n-1)

 

  1. Calculate the standard error estimate from the observed sample: \(\frac{s}{\sqrt{n}}\)

Confidence Interval Construction for \(\mu\)  

 

\[\overline{x}\pm t^* \bigg( \frac{s}{\sqrt{n}}\bigg)\]

Finding t Critical Values

  1. Determine the confidence level, \(CL\)
  2. The critical value, \(t^*\), is found by computing the \(CL + \frac{100-CL}{2}\) percentile on a t distribution with \(n-1\) degrees of freedom.
  • For a 90% confidence interval, the critical value \(t^*\) is equal to ____________ percentile on a t distribution with \(n-1\) degrees of freedom.
  • For a 95% confidence interval, the critical value \(t^*\) is equal to ____________ percentile on a t distribution with \(n-1\) degrees of freedom.
  • For a 99% confidence interval, the critical value \(t^*\) is equal to ____________ percentile on a t distribution with \(n-1\) degrees of freedom.

Finding t Critical Values using qt()

 

qt(p, df) calculates the value on a t distribution curve with df degrees of freedom that has an area of p to the left of it.

 

Example: To find the critical value needed to construct a 90% confidence interval for the mean when \(n=85\), run the following R code:

qt(0.95, 84)

Example continued

Using the sampled data, construct a 95% confidence interval for the average mile time for all runners in the 10 Mile event at the 2017 Cherry Blossom Run.

\(\overline{x} = 591.53\) seconds

\(s = 105.66\) seconds

\(n = 45\)

Example continued

Using the sampled data, construct a 95% confidence interval for the average mile time for all runners in the 10 Mile event at the 2017 Cherry Blossom Run.

\(\overline{x} = 591.53\) seconds

\(s = 105.66\) seconds

\(n = 45\)

  1. Point estimate: 591.53

  2. Confidence level: 95%

  3. Critical value \(t^*\)=qt(0.975, 44)=2.015

  4. Standard error: \(105.66/\sqrt{45}\)

Confidence Interval: \[591.53 \pm 2.015 \bigg(\frac{105.66}{\sqrt{45}}\bigg) = (559.79, 623.27)\]

Interpretation:

We are 95% confident that the average mile time for all runners in the 10 Mile event at the 2017 Cherry Blossom Run was between 559.79 seconds and 623.27 seconds with a point estimate of 591.53 seconds.

Example 2

A random sample of 24 possums in Australia and New Guinea was collected. The length of each sampled possum’s head was measured in millimeters. The sampled distribution and sample statistics of the head lengths is shown below.

 

 

\(\overline{x} = 93.175\) mm \(s = 1.808\) mm

Use the sampled data to construct a 99% confidence interval for the true average head length of all possums in Australia and New Guinea. Don’t forget to check the sample size conditions before constructing the interval.

Example 2

Check sample size condition! The sample size is less than 30; however, the sampled distribution does not suggest any skewness or outliers, so we can still construct confidence interval with the smaller sample size.

  1. Point estimate: 93.175

  2. Confidence level: 99%

  3. Critical value \(t^*\)=qt(0.995, 23)=2.807

  4. Standard error: \(1.808/\sqrt{24}\)

Confidence Interval: \[93.175 \pm 2.807 \bigg(\frac{1.808}{\sqrt{24}}\bigg) = (92.139, 94.211)\]

Interpretation:

We are 99% confident that the average head length of all possums in Australia and New Guinea is between 92.139 mm and 94.211 mm with a point estimate of 93.175 mm.

Practice!

Please complete the End of Class 10 Activity in Top Hat (can be found under the Assigned tab). Collaboration is encouraged!