Topic 1 - Collecting & Exploring Data

Download or print notes to PDF

If you’d like to export this presentation to a PDF, do the following

Toggle into Print View using the E key.
Open the in-browser print dialog (CTRL/CMD+P)
Change the Destination to Save as PDF.
Change the Layout to Landscape.
Change the Margins to None.
Enable the Background graphics option.
Click Save.

This feature has been confirmed to work in Google Chrome and Firefox.

Collecting Data

Population

The complete collection of subjects we are interested in learning or making inference about.

Example:

Parameter

A characteristic about the population, typically unknown or unobservable.

Example:

Sample

An observed subset of the population

Example:

Statistic

A characteristic about the observed sample

Example:

Statistical Inference

The process of using known sampled information to form a conclusion about unknown population characteristics.
Primarily concerned with understanding and quantifying the uncertainty of parameter estimates (Weeks 4-10).

Observational Study

A study that observes and collects information on units but does not attempt to change or influence the units.

Confounding occurs in an observational study when it appears that the outcome of one variable is “causing” the outcome of another.

OpenIntro: Guided Practice 1.12 (pg. 25)

Bias

The tendency to systematically favor certain parts of a population over others.

How can we reduce biases when designing an observational study?

Use a random mechanism when sampling from the population.

Simple random sampling
Stratified random sampling
Cluster random sampling
Systematic random sampling

Designed Experiments

A study in which the observed units are randomly assigned to treatments.

Example:

Four principles of a well-designed randomized experient:

Controlling
Randomization
Replication
Blocking

Design an Observational Study: Part 1

A very large college class has 600 students. The students are divided into 25 groups, each of 24 students, for lab sections administered by different teaching assistants. The instructor wants to conduct a survey about how satisfied the students are with the course, and she believes that the lab section a student is in might affect the student’s overall satisfaction with the course.

Using one of the sampling schemes discussed in this week’s assigned reading, in a few sentences, propose a strategy to sample 100 students from the class so that you have a representative sample of the entire population of interest.

Sampling Schemes:

Simple random sampling
Stratified random sampling
Cluster random sampling
Systematic random sampling

02:30

Poll Everywhere Question - Earn Participation Points!

Please answer the question currently open at

PollEv.com/erinhowardstats

Design an Observational Study: Part 2

Find one or two other students nearby to do this part with.
Introduce yourself if you have not already done so.
One group member should start by reading their study design aloud to the group.
The other group members’ task is to determine what sampling scheme was used.
Make sure each group member has a chance to share their sampling designs.

03:00

Design an Experiment: Part 1

A pharmaceutical company is interested in assessing whether taking daily aspirin reduces the risk of heart attack. 1,500 individuals over the age of 55 have agreed to participate in the company’s study. Of the 1,500 participants, 550 report being at-risk for heart disease based on family medical history. The remaining 950 participants report no predisposition to heart disease.

In a few sentences, briefly outline an experimental design that may allow the researchers to answer the question of interest: “Does taking daily aspirin reduce the risk of heart attack?”

Your design should include the four elements of experimental design: controlling, randomization, replication, and blocking.

02:30

Design an Experiment: Part 2

Return to your small group
In your group, take turns sharing the experimental designs.
After each member shares, identify how controlling, randomization, and replication were implemented in the study.
- If one or more these components is missing from the suggested design, discuss how it could be implemented.

03:00

Exploring Data

Describing Distributions of Quantitative Data

Center: Central tendency of the data

\[ \text{Sample mean: } \overline{x} = \frac{\sum \limits_{i = 1}^n x_i}{n}\]

\(n\): number of observations
\(x_i\): \(i\)th observation in the dataset

Describing Distributions of Quantitative Data

Center: Central tendency of the data

Sample Median, \(M\): middle value of the ordered data

If \(n\) is odd, \(M\), is the middle value in the ordered set of values.
If \(n\) is even, \(M\) is the midpoint (average) of the \(\frac{n}{2}\)th and \(\frac{n}{2}+1\)th observation.

Poll Everywhere Question - Earn Participation Points!

Please answer the question currently open at

PollEv.com/erinhowardstats

Describing Distributions of Quantitative Data

Spread: how spread out is the distribution?

Variance

Describing Distributions of Quantitative Data

Spread: how spread out is the distribution?

Standard deviation: the typical deviation of observations from the mean

\[s = \sqrt{s^2} = \sqrt{\frac{\sum \limits_{i=1}^n(x_i-\overline{x})^2}{n-1}}\]

Standard deviation is often used instead of variance to describe the spread of a distribution since it is expressed in the same units as the variable of interest.

Describing Distributions of Quantitative Data

Spread: how spread out is the distribution?

Interquartile Range (IQR): describes the range of the middle 50% of the data

\[IQR = Q_3 - Q_1\]

\(Q_1\) is the first quartile: 25th percentile, the value such that 25% of data fall below this value

\(Q_3\) is the third quartile: 75th percentile, the value such that 75% of data fall below this value

Describing Distributions of Quantitative Data

Outliers: extreme values that fall outside the pattern of the data

An observation are considered outliers if it is

less than \(Q_1 - 1.5\times IQR\)
greater than \(Q_3 + 1.5 \times IQR\)

Describing Distributions of Quantitative Data

Shape: Overall pattern of the data

Is the distribution symmetric, left-skewed, or right-skewed?
How many peaks does the distribution have? Unimodal, bimodal, or multimodal?

Histogram

Boxplot

Proportions & Percentages

The table displays the number of Nobel laureates in physics, chemistry, medicine, and economy per country from 1969-2020.

Country	Count	Proportion	Percentage
France	15	15/442 = 0.034	3.4%
Germany	20	20/442 = 0.045	4.5%
Japan	15	15/442 = 0.034	3.4%
Sweden	8	8/442 = 0.018	1.8%
Switzerland	15	15/442 = 0.034	3.4%
United Kingdom	45	45/442 = 0.102	10.2%
United States	281	281/442 = 0.636	63.6%
Other	43	43/442 = 0.097	9.7%
Total	442	1.000	100.0%

Source: https://www.statista.com/chart/19646/science-nobel-prizes-by-country-and-immigrant-share/

Topic 1 - Collecting & Exploring Data

Download or print notes to PDF

Collecting Data

Population

Parameter

Sample

Statistic

Statistical Inference

Observational Study

Bias

Designed Experiments

Design an Observational Study: Part 1

Poll Everywhere Question - Earn Participation Points!

PollEv.com/erinhowardstats

Design an Observational Study: Part 2

Design an Experiment: Part 1

Design an Experiment: Part 2

Exploring Data

Describing Distributions of Quantitative Data

Center: Central tendency of the data

Describing Distributions of Quantitative Data

Center: Central tendency of the data

Poll Everywhere Question - Earn Participation Points!

PollEv.com/erinhowardstats

Describing Distributions of Quantitative Data

Spread: how spread out is the distribution?

Describing Distributions of Quantitative Data

Spread: how spread out is the distribution?

Describing Distributions of Quantitative Data

Spread: how spread out is the distribution?

Describing Distributions of Quantitative Data

Outliers: extreme values that fall outside the pattern of the data

Describing Distributions of Quantitative Data

Shape: Overall pattern of the data

Histogram

Boxplot

Proportions & Percentages

Barplots

R Programming Demo