Bootstrap Resampling

In the last lab, we constructed an 95% confidence interval consisting of the most plausible values of a population mean based on a sample (data set) using the fact that

\[\mu \text{ is usually between } \bar{x} - 2SE \text{ and } \bar{x} + 2SE\]

Where do each of the quantities in this statement come from?

\(\mu\) (the population mean) is what we are trying to estimate; we don’t know its value
\(\bar{x}\) is the mean of a quantitative variable from our dataset (the sample mean)
To find \(SE\) we simulated taking many random samples from the population, computing a sample mean for each one, and taking the standard deviation of the collection of sample means

But there is a problem with using this procedure in practice: we do not have access to the population, so we can’t take repeated samples from it! (If we could we could just calculate \(\mu\); we wouldn’t need to estimate it in the first place)

Somehow, we need to be able to estimate the standard error using only the single sample we have

The idea is as follows:

Use the entire sample as an estimate of the entire population (not just the mean as an estimate of the mean)
Simulate repeated sampling from this estimated population
Compute “fake” sample statistics from each “fake sample”
The resulting collection of “fake” sample statistics form a distribution that should have a similar spread to the “real” sampling distribution we’d get by sampling repeatedly from the population itself.

One obvious difference between our sample and the population, however, is that our sample is usually a lot smaller: maybe we have 100 cases, whereas the population has a million. But this is no problem; just treat each case in our population as a stand-in for a whole lot of cases similar to it.

This means that when drawing cases randomly from the “estimated population” (i.e. the sample) to form our “fake samples”, we will never actually “use up” any of the data. That is, we allow ourselves to potentially draw the same case multiple times, even within a single “fake sample”. We call this sampling with replacement.

Imagine trying to simulate drawing marbles from a huge vat that contains 500,000 blue marbles and 500,000 red marbles. We can simulate this with just two marbles: one blue, one red. Each time we want to simulate drawing a marble, we pick randomly from the two marbles. We have a 50% chance of choosing red, and a 50% chance of choosing blue. But then we have to put back the marble we just chose before we choose again; otherwise the chances are changing too much.

The Data

We’ll use the Ames house dataset again, since it should be familiar by now.

Ames <- read.file("http://colindawson.net/data/ames.csv")

Simulating Data-Collection

First let’s simulate collecting a dataset (our sample) consisting of 50 houses randomly selected from the population, as we’ve done before. First we’ll set a random seed so our results are reproducible.

set.seed(00029747) # Replace this with your T number
Sample <- sample(Ames, size = 50)

As a reminder, here’s what we did last week to simulate repeatedly sampling 50 houses from the population, and computing the mean living area from each set of 50 houses.

SamplingDistribution <-           # Assigning a name to our results
  do(1000) *                      # Repeat the following many times (1000 is arbitrary)
    sample(Ames, size = 50) %>%   # Each time, take 50 random houses from the population
    mean(~Area, data = .)         # For each set of 50 houses, compute the mean area

But it’s cheating to use the full Ames dataset in this way, since that represents the entire population, which in reality we would not have access to.

You should verify that in the remaining steps we never actually use the full Ames dataset; only our Sample that we are assuming is the dataset in our actual study.

Here’s what we’ll do instead. Note the parallel structure to what we did before.

BootstrapDistribution <-            # Assigning a name to our results
  do(1000) *                        # Repeat the following many times (5000 is arbitrary)
    resample(Sample, size = 50) %>% # Each time take 50 random houses from the sample
    mean(~Area, data = .)           # For each set of 50 houses, compute the mean area

Besides the name we’re giving to our result (BootstrapDistribution instead of SamplingDistribution) there are only two differences between this procedure and the last one. What are they?

The first difference is that we’re drawing 50 random houses from our sample of 50 houses (Sample), instead of from the population of 2930 houses (Ames).

The second difference is that we’re using the resample() function instead of the sample() function. What does this do?

By using resample() we are specifying that between each house we draw from our pool (Sample in this case), we want to “replace” the last house we drew; that is, we want to allow ourselves to draw it again. This is important, since the pool now only contains 50 houses; so if we didn’t replace houses between draws, drawing 50 houses would just be drawing every house once. And then every sample would be the same.

But they’re not: sometimes we get, say 123 Elm St twice and 300 E College St no times; another time it might be the opposite. As a result, each “fake” sample of 50 is different, and therefore has a different mean.

Let’s plot the resulting collection of means to see:

gf_dotplot(~result, data = BootstrapDistribution, method = "histodot", binwidth = 5) %>%
  gf_refine(scale_y_continuous(NULL, breaks = NULL))

Compare this to the “true” sampling distribution we found (but couldn’t actually find in reality):

gf_dotplot(~result, data = SamplingDistribution, method = "histodot", binwidth = 5) %>%
  gf_refine(scale_y_continuous(NULL, breaks = NULL))

Not identical, but pretty similar, right?

What would a dotplot of our estimated sampling distribution look like if we drew samples of 50 houses but we didn’t replace cases between draws within a sample? Predict the results, then try it out, by using sample() instead of resample() in the procedure where we got our bootstrap distribution, and then plotting the results.
Compute the following four quantities: (a) the mean Area in the population (Ames), (b) the mean Area of the sample (Sample), (c) the mean of the means in the true SamplingDistribution, and (d) the mean of the means in the BootstrapDistribution. What do you notice?

Estimating the Standard Error

When we simulated a sampling distribution using the full population as our pool from which the cases making up our samples were drawn, the standard error was the standard deviation of the resulting collection of means (one from each set of 50 houses).

## Get the standard deviation of the collection of means
trueStandardError <- sd(~result, data = SamplingDistribution)
## Display the result
trueStandardError

Once again, though, (it bears repeating) we can’t actually calculate the standard error like this, since the SamplingDistribution was constructed using the full population which we won’t have access to in practice.

Instead, we can calculate an estimated standard error using the bootstrap distribution. It won’t be exactly the same as the “true” standard error, but it should be close.

## Get the standard deviation of the collection of means
estimatedStandardError <- sd(~result, data = BootstrapDistribution)
## Display the result
estimatedStandardError

Using the Estimated Standard Error to Construct a Confidence Interval

Now that we have a reasonable estimate of the standard error, we can construct a confidence interval using the procedure we used before (summarized at the very start of this lab). Making use of the fact that the population mean is usually within 2 standard errors of any given sample mean, we will take our sample mean, along with the standard error we obtained from our bootstrap distribution, and go two standard errors below the sample mean, and two standard errors above the sample mean. We can be 95% confident that the population mean lies between those values.

## Calculate the sample mean and call it sampleMean (you did this already)
sampleMean <- mean(~Area, data = Sample)
## Find the lower endpoint of the confidence interval (we're letting R be our calculator)
## and store it as CI.lower
CI.lower <- sampleMean - 2 * estimatedStandardError
## Find the upper endpoint of the confidence interval (we're letting R be our calculator)
## and store it as CI.upper
CI.upper <- sampleMean + 2 * estimatedStandardError
## Diplay the results inline
CI.lower
CI.upper

Note that the sample mean doesn’t come from our bootstrap distribution; it just comes from our dataset directly (our Sample). Only the standard error comes from the bootstrap distribution.

Did your interval contain the population mean? In most (95%) groups the answer should be “yes” but in a few (about 5%) it may be “no”. Draw your interval on the whiteboard so you can see how intervals vary depending on the particular sample drawn.

Summary

Here’s what we did, all together:

## Simulating the data collection process itself (wouldn't really be done in code)
set.seed(00029747)
Sample <- sample(Ames, size = 50)

## Compute our sample statistic
sampleMean <- mean(~Area, data = Sample)

## Construct a bootstrap distribution to estimate a standard error
BootstrapDistribution <- 
  do(1000) *
    resample(Sample, size = 50) %>%
    mean(~Area, data = .)

## Compute the estimated standard error by taking the standard deviation of
## the collection of means in the bootstrap distribution
estimatedStandardError <- sd(~result, data = BootstrapDistribution)

## Find the lower and upper endpoints of our confidence interval
## (letting R be our calculator)
CI.lower <- sampleMean - 2 * estimatedStandardError
CI.upper <- sampleMean + 2 * estimatedStandardError

## Display the results
CI.lower
sampleMean
CI.upper

For ease of reference, here is the command again to construct a dot plot. Fill in the name of the data frame containing the values you want to plot in place of MyDataFrame, and fill in the name of the column you want to plot in place of myVariableName. You’ll also typically want to adjust the binwidth= to get the dots to occupy a reasonable amount of space.

gf_dotplot(~myVariableName, data = myDataFrame, method = "histodot", binwidth = 5) %>%
  gf_refine(scale_y_continuous(NULL, breaks = NULL))

On Your Own

Construct a bootstrap 95% confidence interval for the mean sale price (Price) in Ames using your sample. Repeat the process using new samples of size 25 and size 100, respectively (be sure to adjust the sample size per sample when constructing the bootstrap distribution too!). How do the intervals differ? Does the result make sense? Explain.
Construct a 95% confidence interval for the correlation between Area and Price the same way (do everything the same way, except where you computed means before, compute correlations instead; the syntax to compute a correlation between two variables in a dataset is reproduced below for easy reference). Do you notice anything different about the bootstrap distribution of correlations, compared to the distribution of means? (Hint: Plot it!)

cor(firstVariableName ~ secondVariableName, data = DataFrameName, use = "pairwise.complete")