Topic 1 - Collecting & Exploring Data

Download or print notes to PDF

If you’d like to export this presentation to a PDF, do the following

  1. Toggle into Print View using the E key.
  2. Open the in-browser print dialog (CTRL/CMD+P)
  3. Change the Destination to Save as PDF.
  4. Change the Layout to Landscape.
  5. Change the Margins to None.
  6. Enable the Background graphics option.
  7. Click Save.

This feature has been confirmed to work in Google Chrome and Firefox.

Collecting Data

Population

The complete collection of subjects we are interested in learning or making inference about.

Example:

 

Parameter

A characteristic about the population, typically unknown or unobservable.

Example:

Sample

An observed subset of the population

Example:

 

Statistic

A characteristic about the observed sample

Example:

Statistical Inference

  • The process of using known sampled information to form a conclusion about unknown population characteristics.

  • Primarily concerned with understanding and quantifying the uncertainty of parameter estimates (Weeks 4-10).

Observational Study

A study that observes and collects information on units but does not attempt to change or influence the units.

  • Confounding occurs in an observational study when it appears that the outcome of one variable is “causing” the outcome of another.

 

OpenIntro: Guided Practice 1.12 (pg. 25)

Bias

The tendency to systematically favor certain parts of a population over others.

How can we reduce biases when designing an observational study?

Use a random mechanism when sampling from the population.

  • Simple random sampling
  • Stratified random sampling
  • Cluster random sampling
  • Systematic random sampling

Designed Experiments

A study in which the observed units are randomly assigned to treatments.

Example:

 

Four principles of a well-designed randomized experient:

  1. Controlling
  2. Randomization
  3. Replication
  4. Blocking

Design an Observational Study: Part 1

A very large college class has 600 students. The students are divided into 25 groups, each of 24 students, for lab sections administered by different teaching assistants. The instructor wants to conduct a survey about how satisfied the students are with the course, and she believes that the lab section a student is in might affect the student’s overall satisfaction with the course.

Using one of the sampling schemes discussed in this week’s assigned reading, in a few sentences, propose a strategy to sample 100 students from the class so that you have a representative sample of the entire population of interest.

Sampling Schemes:

  • Simple random sampling
  • Stratified random sampling
  • Cluster random sampling
  • Systematic random sampling
02:30

Poll Everywhere Question - Earn Participation Points!

Please answer the question currently open at

PollEv.com/erinhowardstats

Design an Observational Study: Part 2

  • Find one or two other students nearby to do this part with.

  • Introduce yourself if you have not already done so.

  • One group member should start by reading their study design aloud to the group.

  • The other group members’ task is to determine what sampling scheme was used.

  • Make sure each group member has a chance to share their sampling designs.

03:00

Design an Experiment: Part 1

A pharmaceutical company is interested in assessing whether taking daily aspirin reduces the risk of heart attack. 1,500 individuals over the age of 55 have agreed to participate in the company’s study. Of the 1,500 participants, 550 report being at-risk for heart disease based on family medical history. The remaining 950 participants report no predisposition to heart disease.

In a few sentences, briefly outline an experimental design that may allow the researchers to answer the question of interest: “Does taking daily aspirin reduce the risk of heart attack?”

Your design should include the four elements of experimental design: controlling, randomization, replication, and blocking.

02:30

Design an Experiment: Part 2

  • Return to your small group

  • In your group, take turns sharing the experimental designs.

  • After each member shares, identify how controlling, randomization, and replication were implemented in the study.

    • If one or more these components is missing from the suggested design, discuss how it could be implemented.
03:00

Exploring Data

Describing Distributions of Quantitative Data

Center: Central tendency of the data

\[ \text{Sample mean: } \overline{x} = \frac{\sum \limits_{i = 1}^n x_i}{n}\]

  • \(n\): number of observations

  • \(x_i\): \(i\)th observation in the dataset

Describing Distributions of Quantitative Data

Center: Central tendency of the data

Sample Median, \(M\): middle value of the ordered data

  • If \(n\) is odd, \(M\), is the middle value in the ordered set of values.

  • If \(n\) is even, \(M\) is the midpoint (average) of the \(\frac{n}{2}\)th and \(\frac{n}{2}+1\)th observation.

Poll Everywhere Question - Earn Participation Points!

Please answer the question currently open at

PollEv.com/erinhowardstats

Describing Distributions of Quantitative Data

Spread: how spread out is the distribution?

Variance

Describing Distributions of Quantitative Data

Spread: how spread out is the distribution?

Standard deviation: the typical deviation of observations from the mean

\[s = \sqrt{s^2} = \sqrt{\frac{\sum \limits_{i=1}^n(x_i-\overline{x})^2}{n-1}}\]

Standard deviation is often used instead of variance to describe the spread of a distribution since it is expressed in the same units as the variable of interest.

Describing Distributions of Quantitative Data

Spread: how spread out is the distribution?

Interquartile Range (IQR): describes the range of the middle 50% of the data

\[IQR = Q_3 - Q_1\]

\(Q_1\) is the first quartile: 25th percentile, the value such that 25% of data fall below this value

\(Q_3\) is the third quartile: 75th percentile, the value such that 75% of data fall below this value

Describing Distributions of Quantitative Data

Outliers: extreme values that fall outside the pattern of the data

An observation are considered outliers if it is

  • less than \(Q_1 - 1.5\times IQR\)

  • greater than \(Q_3 + 1.5 \times IQR\)

Describing Distributions of Quantitative Data

Shape: Overall pattern of the data

  • Is the distribution symmetric, left-skewed, or right-skewed?

  • How many peaks does the distribution have? Unimodal, bimodal, or multimodal?

Histogram

Boxplot

Proportions & Percentages

The table displays the number of Nobel laureates in physics, chemistry, medicine, and economy per country from 1969-2020.

Country Count Proportion Percentage
France 15 15/442 = 0.034 3.4%
Germany 20 20/442 = 0.045 4.5%
Japan 15 15/442 = 0.034 3.4%
Sweden 8 8/442 = 0.018 1.8%
Switzerland 15 15/442 = 0.034 3.4%
United Kingdom 45 45/442 = 0.102 10.2%
United States 281 281/442 = 0.636 63.6%
Other 43 43/442 = 0.097 9.7%
Total 442 1.000 100.0%

Source: https://www.statista.com/chart/19646/science-nobel-prizes-by-country-and-immigrant-share/

Barplots

R Programming Demo