STAT 113 Lab 7: Bootstrap Confidence Intervals

Review of Key Concepts

Confidence Interval Logic

We have constructed 95% confidence intervals consisting of the most plausible values of a parameter (such as what we’ve been calling a “pMean”, the mean of some variable in a population or process) based on a data set (a “sample” from a population, or a “snapshot” of a process) using the following facts:

Statistics (“sMeans”, “sProportions”, etc.) often have a bell-shaped sampling distribution when we consider all potential datasets
In a bell-shaped distribution, about 95% of individual values (individual statistics in this case) are within two standard deviations (called standard errors when we’re talking about a sampling distribution) of the mean of the distribution
When data-collection is unbiased the mean of the sampling distribution of the statistics we’ve looked at is equal to the corresponding parameter (the “pMean”, “pProportion”, etc.)
Therefore, about 95% of potential representative datasets have statistics which are within two standard errors of the corresponding parameter
So, if we have a representative dataset and compute the statistic of interest, and can somehow get a good estimate of the standard error, then chances are the corresponding parameter is two standard errors or less away from the statistic

But, since we do not have access to the population and can’t gather thousands of datasets from it (if we did and could, we wouldn’t need to estimate anything!), how do we know what the standard deviation of the sampling distribution (the standard error) actually is?

Bootstrap Resampling

Somehow, we need to be able to estimate the standard error using only the single sample we have

One approach, called Bootstrap Resampling, is as follows:

Use the entire sample as an estimate of the entire population (not just the mean as an estimate of the mean)
Simulate collecting many datasets from this proxy population (each called a bootstrap sample)
Compute statistics (called bootstrap statistics) from each simulated dataset
The resulting collection of statistics form a distribution (called a bootstrap distribution) that should have a similar shape and spread to the “real” sampling distribution we’d get by collecting many datasets from the population/process itself

One obvious difference between our dataset and the population, however, is that our sample is usually a lot smaller: maybe we have 100 cases, whereas the population has a million (or if it is a process, maybe it has unlimited potential cases it can generate).

But this is no problem; we can just treat each case in our population as a stand-in for a whole lot of cases similar to it.

This means that when gathering cases randomly from the “proxy population” (i.e. the dataset) to form our “simulated datasets” (bootstrap samples), we will never actually “use up” any of the data.

That is, we allow ourselves to potentially draw the same case multiple times, even within a single simulated dataset. We call this sampling with replacement.

Learning Goals of this Lab

Strengthen intuitions about inference generally, and confidence intervals in particular, through hands-on explorations
See how to carry out bootstrap resampling in R
Use simulation to explore the performance of confidence intervals in the long run

The Data

We’ll use the AmesAll house dataset again as our population. Remember that in a real application we would not have access to this; we are doing this now for the sake of seeing how confidence intervals behave.

AmesAll <- read.file("http://colindawson.net/data/ames.csv")

Simulating Data-Collection

First let’s simulate collecting a dataset as we did in the last lab. Our dataset will consist of 50 houses randomly selected from all houses sold in Ames between 2006-10.

Choose a random seed unique to you. Modifying the code below to use your seed, and then run it.

Code for Exercise 1

set.seed(00029747) # Replace this with your choice of number (unique to you)
HomeData <- sample(AmesAll, size = 50)

Recap: Simulating the Sampling Distribution using the Population

As a reminder, here’s what we did in the last lab to simulate repeatedly sampling 50 houses from the population, and computing the mean living area from each set of 50 houses.

SamplingDistribution_MeanArea_n50_s5000 <-  # Assigning a name to our results
  do(5000) *                                # Repeat the following 5000 times (5000 is arbitrary)
    sample(AmesAll, size = 50) %>%          # Each time, take 50 random houses from the population
    mean(~Area, data = .)                   # For each set of 50 houses (represented by the .)
                                            # get the mean area, and put it in the sampling dist.

But, it’s “cheating” to use the full AmesAll dataset in this way, since that represents the entire population, which in reality we would not have access to.

Bootstrap Resampling

You should verify that in the remaining steps we never actually use the full Ames dataset; we only ever use our HomeData dataset that plays the role of the dataset in our actual study.

Here’s what we’ll do instead. Note the parallel structure to what we did before. (Replace the seed before running it)

set.seed(8675309)                      # To aid in reproducibility
Bootstrap_MeanArea_s5000 <-            # Putting our resulting means in a labeled container
  do(5000) *                           # Repeats what follows many times (5000 is an arbitrary
                                       # "big number")
    resample(HomeData, size = 50) %>%  # Each time take 50 random houses with replacement 
                                       # from the _dataset_. The value 50 is _not_ arbitrary.
    mean(~Area, data = .)              # For each set of 50 houses, compute the mean area

Note: While the 5000 is an arbitrary “large number”, chosen to obtain a reasonably smooth bell-curve distribution, the 50 is not arbitrary. We choose 50 because that is the sample size of our dataset, and we are interested in the behavior of means computed from datasets of 50 houses, not datasets with some other number of houses.

Besides the name we’re giving to our result (Bootstrap_MeanArea instead of SamplingDistribution_MeanArea) there are only two differences between this procedure and the last one. What are they?

Response to Exercise 2

Spoilers for Exercise 2 Follow!

By using resample() we are specifying that between each house we draw from our pool (HomeData in this case), we want to “replace” the last house we drew; that is, we want to allow ourselves to draw it again.

This is important, since the pool now only contains 50 houses; so if we didn’t replace houses between draws, drawing 50 houses from the pool would just be drawing every house once. And then every simulated dataset would be the same.

But they’re not: sometimes we get, say 123 Elm St twice and 300 E College St no times; another time it might be the opposite. As a result, each simulated dataset of 50 houses is different, and therefore has a different mean area.

Let’s plot the resulting collection of means to see:

gf_histogram(
  ~result, 
  data     = Bootstrap_MeanArea_s5000, 
  binwidth = 5,  ## usually need to tweak this for each histogram
  xlab     = "Bootstrap Mean (bMean)",
  ylab     = "Number of Bootstrap Datasets (out of 5000)")

Compare this to the “true” sampling distribution we found (but couldn’t actually find in reality):

gf_histogram(
  ~result, 
  data     = SamplingDistribution_MeanArea_n50_s5000, 
  binwidth = 5,
  xlab     = "Sample Mean (sMean)",
  ylab     = "Number of Simulated Datasets (out of 5000)")

Not identical, but pretty similar, right?

Find the following means, rounded to one decimal place.
- The pMean of the Areas in the full AmesData set (what we want to estimate, since we wouldn’t have access to it in reality)
- the sMean of the areas in our HomeData dataset (what we’d start with in reality)
- the mean of the bootstrap means (“bMeans”)
- the mean of the sampling distribution of means (the simulated sMeans from many potential datasets we could have gotten).

What do you notice?

Code Hint: Remember that the command to find a mean looks like mean(~VariableName, data = DatasetName) where VariableName is the name of a column in a dataset and DatasetName is the object in your environment that contains the data.

Code and Response to Exercise 3

## Replace the 0s with commands that return the quantities of interest
pMean_Area <- mean(~Area, data = AmesAll)
sMean_Area <- 0
mean_of_sMeans <- 0
mean_of_bMeans <- 0

(If you store the corresponding means in objects with the names used below, then the values will display in the Knitted document automatically)

The pMean is 1499.6904437. My dataset mean is 0.

The mean of means from the true sampling distribution is 0.

The mean of means from the bootstrap distribution is 0.

The mean of means from the true sampling distribution is about the same as ______

The mean of means from the bootstrap distribution is about the same as _______

What would a histogram of our estimated sampling distribution look like if we drew samples of 50 houses but we didn’t replace cases between draws within a sample? Predict the results first, then try it out, by using sample() instead of resample() in the procedure where we got our bootstrap distribution. Save the results in a container called WrongBootstrap, and then plot the resulting means using the histogram command provided (I’ve added a piece for you that sets the range on the x axis to be about the same as for the previous histogram for easier comparison).

Exercise 4 Response and Code

Estimating the Standard Error

When we simulated a sampling distribution using the full population as our pool from which the houses in our simulated datasets were drawn, the standard error was the standard deviation of the resulting collection of means (one from each set of 50 houses).

## Get the standard deviation of the collection of means (couldn't actually get in reality)
trueStandardError <- sd(~result, data = SamplingDistribution_MeanArea_n50_s5000)
trueStandardError_rounded <- round(trueStandardError, digits = 1)

The true standard error is 71.4.

Once again, though, (it bears repeating) we can’t actually calculate the standard error like this, since the SamplingDistribution was constructed using the full population which we won’t have access to in practice.

Instead, we can calculate an estimated standard error using the bootstrap distribution. It won’t be exactly the same as the “true” standard error, but it should be close.

Calculate the estimated standard error using the bootstrap distribution, and report it in the text to one decimal place.

Code and Response to Exercise 5

Using the Estimated Standard Error to Construct a Confidence Interval

Now that we have a reasonable estimate of the standard error, we can construct a confidence interval.

Making use of the fact that the population mean is within 2 standard errors of the sMean of most representative datasets, we will take our sMean along with the estimated standard error from our bootstrap distribution, and go two standard errors below the sample mean, and two standard errors above the sample mean. We can be 95% confident that the population mean lies between those values.

Find the lower and upper endpoints of the confidence interval for the mean area of houses sold in Ames. Store both endpoints in named R objects (containers), and then report and interpret the confidence interval in the text. Round the endpoints to one decimal place (again, use inline reference syntax to ensure that the reported values match what the code computes). Note that the sMean doesn’t come from our bootstrap distribution; it just comes from our dataset (HomeData) directly. Only the estimated standard error comes from the bootstrap distribution.

Code and response to Exercise 6

Did your interval contain the parameter of interest? In most (95%) cases the answer should be “yes” but in a few (about 5%) it may be “no”.

Exercise 7 Response

Enter the following three values into the Google form here: https://forms.gle/ao7NowsZ3ZCEJqRS7
- Your point estimate of the pMean (that is to say, your sMean)
- The lower endpoint of your confidence interval
- The upper endpoint of your confidence interval

Nothing to Write

Run the following code chunk, which will download everyone’s responses from the Google form and plot the intervals (I don’t expect you to follow the code in the chunk; we just want the plot). What proportion of the intervals produced by the class contain the true value of the parameter (depicted by the vertical dashed line)?

Exercise 9 Response

Recap and Code Summary

The following is a recap of the process and the code we used to construct a boostrap confidence interval

First, compute the statistic from the data (here, a mean Area)

sMean_Area <- mean(~Area, data = HomeData)

Next, set the random seed and create the bootstrap distribution.

set.seed(5551212)
Bootstrap_MeanArea_s5000 <- 
  do(5000) *
      resample(HomeData, size = 50) %>%
      mean(~Area, data = .)

Then, calculate the estimated standard error.

est_SE_MeanArea <- sd(~result, data = Bootstrap_MeanArea_s5000)

The lower and upper endpoints of the interval are the sample statistic plus and minus two standard errors.

CI_lower_meanArea <- sMean_Area - 2 * est_SE_MeanArea
CI_upper_meanArea <- sMean_Area + 2 * est_SE_MeanArea

## rounding, for reporting inline
sMean_Area_rounded <- round(sMean_Area, digits = 1)
CI_lower_rounded <- round(CI_lower_meanArea, digits = 1)
CI_upper_rounded <- round(CI_upper_meanArea, digits = 1)

Based on our dataset consisting of 50 randomly sampled homes, where the mean area was 1509.1, we are 95% confident that the mean area of houses sold in Ames between 2006 and 2010 is between 1374.8 and 1643.4.

HOMEWORK

Construct a bootstrap 95% confidence interval for the mean sale price (Price) in Ames, using your dataset of 50 houses.

Code and Response for Exercise 10

“Collect” two new datasets, one with 25 houses (half the size of your original dataset) and one with 100 houses (twice the size of your original dataset), and compute confidence intervals for the mean price based on each dataset. Remember that the bootstrap distribution needs to consist of means from datasets the same size as yours! How do the widths of the intervals differ? Does the result make sense? Explain.

Code and Response to Exercise 11

Construct a 95% confidence interval for the correlation between Area and Price as the sCorrelation plus and minus two standard errors of the correlation. You may need to refer back to Lab 5 to remind yourself how to compute a correlation in R; essentially everything is the same as for a mean, except that instead of calculating a mean of each bootstrap sample, you will calculate a correlation). Do you notice anything different about the shape of the bootstrap distribution of correlations, compared to the distribution of means? (Hint: Plot it!) Does this raise any concerns about the formula we are using for the interval? Explain.

Exercise 12 Code and Response

GRADING RUBRIC

SLO D3 (Exercise 12):

+1: The bootstrap code uses resample() to sample with replacement +1: The bootstrap code samples from the HomeData sample (as opposed to sampling from the AmesAll population) +1: The correlation is computed for each bootstrap sample +1: The standard error is computed as the standard deviation of the results in the bootstrap distribution +1: The confidence interval is centered on the correlation in the HomeData sample +1: The margin of error is 2 standard errors +1: The interval is interpreted as being about the correlation between price and area +1: The interval is interpreted as being about the population of all houses in Ames