Attribution

This lab was modified by Colin Dawson from source material by Andrew Bray, Mine Çetinkaya-Rundel, and the UCLA statistics department which accompanies the OpenIntro statistics textbooks. This handout as well as the source material is covered by a CreativeCommons Attribution-ShareAlike 3.0 Unported license.

Overview and Learning Goals

In this lab we will use simulation to construct sampling distributions using a large dataset as a “stand-in” for an actual Population or Data-Generating Process.

The learning goals are:

  • To develop intuitions about the concepts of statistics, parameters, sampling distributions, and sampling error (the discrepancy between a statistic and the parameter it is meant to estimate), and how these relate to each other.
  • To see how R can be used as a tool for conducting simulations involving many repetitions of the same process. Although the simulations we will do today are in the realm of “thought experiment” to understand concepts, rather than actual applied statistical analysis, many of the code structures will be used for actual analyses with small variations.

Some Important Definitions

Some of the terms in the overview paragraph were only recently defined in class. As a reminder, here are their definitions.

Definition: A population (or data-generating process) consists of all potential cases of interest. Data-generating processes have parameters (like a mean, median, standard deviation, correlation, etc.) that summarize some property of the process (notice that all the words that start with ‘p’ go together!).

Definition: Datasets (or samples) are snapshots produced by these data-generating processes (possibly via collecting a sample from a population), and can be summarized with statistics (mean, median, standard deviation, correlation, etc.). (Most of these are ‘s’ words!) Each dataset produced by a process is different, even if nothing about the process has changed, as are their statistics.

Definition: Statistics summarize characteristics of datasets/samples, and are used to estimate the value of the corresponding parameter from the underlying process/population. When the data-collection method is unbiased, the potential for overestimation is balanced by the potential for underestimation.

Definition: To understand how good our estimates are, we want to investigate the sampling distribution of our statistic, which consists of a collection of values of that statistic, each one calculated using a different dataset produced by the process. Sampling distributions are “hypothetical” rather than concrete, in that we will never actually have direct access to one in real life; but we can still study their hypothetical properties, and estimate what they would look like.

The Stand-In for the Population/Process

In reality we almost never have access to the underlying process or the full population. But for this lab, to make things concrete, we will use a large dataset as a stand-in for a population.

The data consists of thousands of real estate transactions from the city of Ames, Iowa, as recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection is a stand-in for our population of interest.

First, let’s load the mosaic package and the data.

library(mosaic) # Redundant, since it's loaded above, but doesn't hurt
AmesAll <- read.file("http://colindawson.net/data/ames.csv")

This is a reduced dataset with just two variables:

Variable Units Description
Area Square Feet Total above-ground living area of the house
Price US Dollars Sale price of the home

We will focus on the variable Area for now.

Sampling from the Population

Gathering information on an entire population is often extremely costly or impossible. Instead, we often take a sample (a “snapshot”) of the population and use that to understand the properties of the population.

Suppose we were interested in estimating the mean living area of houses in Ames.

We might survey a random sample of, say, 50 homes in the city and collect various information about each home. The full population contains 2930 homes, so we will have data on less than 3% of the population, but it will turn out that we can make some decent estimates about the population, provided our sample is representative.

Let’s create a dataset consisting of 50 houses randomly sampled from the population. Again, the AmesAll data frame represents our population, which we would not normally have access to. The 50 houses we sample represent an actual dataset we would have.

The sample() command takes a subset of cases from a data frame.

The value of the size= argument specifies how many cases will be in our dataset.

The result is a data frame consisting of the sampled cases.

We will store the result in a named variable called HomeData (we could pick any name we want).

An R tip when doing random simulations

In this lab we will be doing some stochastic (random) simulations, like sampling many times from a population.

This will involve commands that take random samples from larger datasets.

Since the sample is random, it will change every time you re-run the command. This can be annoying if you are trying to describe the results in text.

To avoid this issue, you can include the following line of code at the start of a chunk that does random simulation to initialize R’s random number generator. That will cause it to give the same results when you re-run or Knit your document.

## You can pick any number here; I've used my T number
## This should go before any line that involves doing something random.
## Each person in the group should use a different number so you can get different results!
## This is important for the goal of seeing how different possible datasets differ
set.seed(00029747)
HomeData <- sample(AmesAll, size = 50)
  1. Describe the distribution of the Area variable in your HomeData dataset (you’ll want to plot them first – choose an appropriate plot for this type of variable). You can refer to Lab 4 for examples of plot commands, but remember that most of them have the form gf_typeofplot(~VariableName, data = DatasetName) for single variable plots, or gf_typeofplot(YVariable~XVariable, data = DatasetName) for plots involving two variables.

Code and Response to Exercise 1


  1. What would you say is the “typical” home size within your HomeData dataset? What would you interpret “typical” to mean in this context? (Remember that different group members have slightly different data due to the random sampling process)

Code and Response to Exercise 2


  1. Would you expect another student’s graph and summary statistic to be identical to yours? Would you expect it to be similar? Why or why not?

Response to Exercise 3


You may have chosen to use the mean or the median as a “typical” value in the distribution of areas. Either is a reasonable choice. So that we’re all on the same page, let’s focus on the mean.

## I'm using `sMean` as a shorthand for 'sample mean', or 'dataset mean',
## or 'snapshot mean'. I've also included
## the name of the variable I'm taking the mean of, so I can remember
## what this value represents.
sMean_Area <- mean(~Area, data = HomeData)
  1. If you run the above code using the “Run Chunk” button, you should see the new sMean_Area object appear in your Environment tab. Write a sentence giving its value, rounded to one decimal place. (Optional: Use the inline reference syntax we learned in the last lab so that you can just type the variable name in your sentence and have R replace it with its value for you. To round, you can use round(OriginalValue, digits = 1). This command can go inline right in your sentence, or you can create a new variable first with the rounded value)

Response to Exercise 4


  1. Now find the corresponding parameter (the “pMean”) from the full AmesAll dataset. How far away is your statistic from the parameter it is an estimate of? Is the sampling error (the statistic minus the parameter) positive (the statistic is an overestimate) or negative (the statistic is an underestimate)?

Response to Exercise 5


  1. Enter your sMean and the value you got for the sampling error into the Google Form here: https://forms.gle/Gf3QcshkA9k1ajWU6

Nothing to write here


Once a few people have entered their results, you can see a graph of everyone’s values by running the following code to download and import everyone’s data from the Google Sheet. You might get a warning about an “incomplete final line”, but it should still make the plot.

(Don’t worry about the details of this code chunk unless you’re really interested; it’s not here because I expect you to learn how to do it; just to get the plot to look at)

url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQLgaH2TUTjZmEjaMJ3eJlikW3QtmgzMH08rHaUdTxOCa8ZpIKF0mko-lIWFscWhLRdgtsvUwrMQM6B/pub?gid=1286280604&single=true&output=csv"
download.file(url, destfile = "class_simulation.csv")
smallSamplingDist <- read.file("class_simulation.csv")
## Replace my name with yours in fill = ~(Name == "Colin") below. This
## will color your value a different color than the others.
pMean = mean(~Area, data = AmesAll) %>% round(digits = 1)
gf_histogram(
    ~sMean, 
    data = smallSamplingDist, 
    fill = ~(Name == "Colin"), 
    binwidth = 10,
    xlab = "Mean Area of 50 Houses",
    ylab = "Number of Data Sets") +
  geom_vline(xintercept = pMean) +
  annotate("label", x = pMean, y = 2, label = paste("pMean =", pMean))

Depending on which 50 homes your dataset happens to contain, your estimate of the mean living area in Ames will either be a bit above or a bit below the mean living area of all homes in Ames.

In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.

Scaling Up the Simulation

In this lab, because we are working with a stand-in for the population, which consists of actual data, we can see how far off a mean from 50 randomly selected houses tends to be from the target parameter by repeating the above steps many times.

Computing a sample mean is easy: just get the sample, and compute the mean, as you did above.

With the speed of modern computers, it is easy to simulate sampling many times from a population.

For example, we can simulate collecting 10 different datasets, each consisting of 50 houses, and looking at how the statistic (the sMean) varies across these 10 datasets:

The following code will collect 10 datasets of 50 houses each, simulating replicating the data-collection process 10 times.

For each of the 10 datasets we will compute the mean Area, and then store the collection of 10 means in a new data frame object.

SamplingDistribution_n50_s10 <- do(10) *  # A cute shorthand for Do <X> 10 times
  sample(AmesAll, size = 50) %>%          # This line collects a dataset of 50 and passes it on
    mean(~Area, data = .)                 # This line computes the mean Area in that dataset
## After all is said and done, do(10) * sample(...) %>% mean(...) returns a data frame
## consisting of s = 10 mean areas, each from a different subset of n = 50 houses. The resulting
## data frame is stored in a container called SamplingDistribution_n50_s10

R and mosaic note

The do() function in mosaic allows you to repeat some code a specified number of times, and store the results of each iteration in a data frame.

The “pipe” (%>%) operator, passes the data frame produced by the sample() function on to the mean() function as the data argument, where it is represented by the . (you should leave this as a literal period).

Each time we iterate, we are doing the same thing: take a random subset of 50 houses in Ames, and compute the mean Area of the 50 properties in the sample.

Since each subset is random, we wind up with 10 different subsets of size 50, which may overlap with each other, and 10 different means, which we store in a data frame called SamplingDistribution_n50_s10. The data frame has just one column, called result, which contains the means of each sample.

  1. Examine the SamplingDistribution_n50_s10 data frame. Verify that the column is called result, and the values look similar to the sample means from the Google Form.

Nothing to write


Let’s plot the resulting distribution of means with a histogram.

gf_histogram(
    ~result, 
    data = SamplingDistribution_n50_s10, 
    binwidth = 10,
    xlab = "Mean Area of 50 Houses",
    ylab = "Number of Datasets") +
  ## The next two lines just add the vertical line at the pMean, and 
  ## then add the text label
  geom_vline(xintercept = pMean) +
  annotate("label", x = pMean, y = 2, label = paste("pMean =", pMean))

  1. Compute the “mean of means” from your sampling distribution of 10 sMeans. Is it close to the actual parameter (i.e., the “pMean”, which represents the mean area across all houses in Ames)? (Hint: Remember that the variable in SamplingDistribution_n50_s10 is called result)

Code and Response to Exercise 8


HOMEWORK

  1. Scale up the simulation to produce sampling distribution of means using 20 datasets, 100 datasets, and 5000 datasets (keep the size of each dataset at n = 50). What do you notice about the shape of the sampling distribution as you start to add more and more sMeans to it?

Code and Response to Exercise 9


  1. Using your sampling distribution of 5000 sMeans, find a “typical” distance between each sMean and the pMean. (Hint 1: Note that the pMean is also the mean of the sMeans. Hint 2: We’ve learned about a summary statistic that tells us how far individual data points tend to be from their collective mean. You might want to refer back to “Summarizing a Quantitative Variable” from Lab 4 for code to compute various measures, or look up the function name on the “Most Frequently Used R Commands” cheatsheet on the Labs tab on the course website.)

Code and Response to Exercise 10


  1. Approximately what percentage of sMeans would you expect to be within this “typical” distance of the pMean? (Hint: Use your answer to exercise 9 and think about what we know about distributions with that shape)

Code and Response to Exercise 11


  1. Create a sampling distribution of 5000 sMeans, but this time include 200 houses in each dataset (that is, change the sample size, n, to 200).

Code and Response to Exercise 12


  1. Plot the resulting set of 5000 sMeans, and compute the “typical” distance between sMean and pMean. With 200 houses in each dataset, how does the expected precision of an estimate from a single dataset change?

Code and Response to Exercise 13


  1. Having an imperfect reflection of the data-generating process in the form of a dataset with just n cases introduces estimation error even if the dataset is perfectly representative; that is, the sMean is not exactly equal to the pMean we are trying to estimate. How is the typical size of this error affected by the value of n?

Response to Exercise 14

c