We have constructed 95% confidence intervals consisting of the most plausible values of a parameter (such as what we’ve been calling a “pMean”, the mean of some variable in a population or process) based on a data set (a “sample” from a population, or a “snapshot” of a process) using the following facts:
But, since we do not have access to the population and can’t gather thousands of datasets from it (if we did and could, we wouldn’t need to estimate anything!), how do we know what the standard deviation of the sampling distribution (the standard error) actually is?
Somehow, we need to be able to estimate the standard error using only the single sample we have
One approach, called Bootstrap Resampling, is as follows:
One obvious difference between our dataset and the population, however, is that our sample is usually a lot smaller: maybe we have 100 cases, whereas the population has a million (or if it is a process, maybe it has unlimited potential cases it can generate).
But this is no problem; we can just treat each case in our population as a stand-in for a whole lot of cases similar to it.
This means that when gathering cases randomly from the “proxy population” (i.e. the dataset) to form our “simulated datasets” (bootstrap samples), we will never actually “use up” any of the data.
That is, we allow ourselves to potentially draw the same case multiple times, even within a single simulated dataset. We call this sampling with replacement.
We’ll use the AmesAll
house dataset again as our population. Remember that in a real application we would not have access to this; we are doing this now for the sake of seeing how confidence intervals behave.
<- read.file("http://colindawson.net/data/ames.csv") AmesAll
First let’s simulate collecting a dataset as we did in the last lab. Our dataset will consist of 50 houses randomly selected from all houses sold in Ames between 2006-10.
set.seed(00029747) # Replace this with your choice of number (unique to you)
<- sample(AmesAll, size = 50) HomeData
As a reminder, here’s what we did in the last lab to simulate repeatedly sampling 50 houses from the population, and computing the mean living area from each set of 50 houses.
<- # Assigning a name to our results
SamplingDistribution_MeanArea_n50_s5000 do(5000) * # Repeat the following 5000 times (5000 is arbitrary)
sample(AmesAll, size = 50) %>% # Each time, take 50 random houses from the population
mean(~Area, data = .) # For each set of 50 houses (represented by the .)
# get the mean area, and put it in the sampling dist.
But, it’s “cheating” to use the full AmesAll
dataset in this way, since that represents the entire population, which in reality we would not have access to.
You should verify that in the remaining steps we never actually use the full Ames
dataset; we only ever use our HomeData
dataset that plays the role of the dataset in our actual study.
Here’s what we’ll do instead. Note the parallel structure to what we did before. (Replace the seed before running it)
set.seed(8675309) # To aid in reproducibility
<- # Putting our resulting means in a labeled container
Bootstrap_MeanArea_s5000 do(5000) * # Repeats what follows many times (5000 is an arbitrary
# "big number")
resample(HomeData, size = 50) %>% # Each time take 50 random houses with replacement
# from the _dataset_. The value 50 is _not_ arbitrary.
mean(~Area, data = .) # For each set of 50 houses, compute the mean area
Note: While the 5000 is an arbitrary “large number”, chosen to obtain a reasonably smooth bell-curve distribution, the 50 is not arbitrary. We choose 50 because that is the sample size of our dataset, and we are interested in the behavior of means computed from datasets of 50 houses, not datasets with some other number of houses.
Bootstrap_MeanArea
instead of SamplingDistribution_MeanArea
) there are only two differences between this procedure and the last one. What are they?Spoilers for Exercise 2 Follow!
By using resample()
we are specifying that between each house we draw from our pool (HomeData
in this case), we want to “replace” the last house we drew; that is, we want to allow ourselves to draw it again.
This is important, since the pool now only contains 50 houses; so if we didn’t replace houses between draws, drawing 50 houses from the pool would just be drawing every house once. And then every simulated dataset would be the same.
But they’re not: sometimes we get, say 123 Elm St twice and 300 E College St no times; another time it might be the opposite. As a result, each simulated dataset of 50 houses is different, and therefore has a different mean area.
Let’s plot the resulting collection of means to see:
gf_histogram(
~result,
data = Bootstrap_MeanArea_s5000,
binwidth = 5, ## usually need to tweak this for each histogram
xlab = "Bootstrap Mean (bMean)",
ylab = "Number of Bootstrap Datasets (out of 5000)")
Compare this to the “true” sampling distribution we found (but couldn’t actually find in reality):
gf_histogram(
~result,
data = SamplingDistribution_MeanArea_n50_s5000,
binwidth = 5,
xlab = "Sample Mean (sMean)",
ylab = "Number of Simulated Datasets (out of 5000)")
Not identical, but pretty similar, right?
AmesData
set (what we want to estimate, since we wouldn’t have access to it in reality)HomeData
dataset (what we’d start with in reality)What do you notice?
Code Hint: Remember that the command to find a mean looks like mean(~VariableName, data = DatasetName)
where VariableName
is the name of a column in a dataset and DatasetName
is the object in your environment that contains the data.
## Replace the 0s with commands that return the quantities of interest
<- mean(~Area, data = AmesAll)
pMean_Area <- 0
sMean_Area <- 0
mean_of_sMeans <- 0 mean_of_bMeans
(If you store the corresponding means in objects with the names used below, then the values will display in the Knitted document automatically)
The pMean is 1499.6904437. My dataset mean is 0.
The mean of means from the true sampling distribution is 0.
The mean of means from the bootstrap distribution is 0.
The mean of means from the true sampling distribution is about the same as ______
The mean of means from the bootstrap distribution is about the same as _______
sample()
instead of resample()
in the procedure where we got our bootstrap distribution. Save the results in a container called WrongBootstrap
, and then plot the resulting means using the histogram command provided (I’ve added a piece for you that sets the range on the x axis to be about the same as for the previous histogram for easier comparison).When we simulated a sampling distribution using the full population as our pool from which the houses in our simulated datasets were drawn, the standard error was the standard deviation of the resulting collection of means (one from each set of 50 houses).
## Get the standard deviation of the collection of means (couldn't actually get in reality)
<- sd(~result, data = SamplingDistribution_MeanArea_n50_s5000)
trueStandardError <- round(trueStandardError, digits = 1) trueStandardError_rounded
The true standard error is 71.4.
Once again, though, (it bears repeating) we can’t actually calculate the standard error like this, since the SamplingDistribution
was constructed using the full population which we won’t have access to in practice.
Instead, we can calculate an estimated standard error using the bootstrap distribution. It won’t be exactly the same as the “true” standard error, but it should be close.
Now that we have a reasonable estimate of the standard error, we can construct a confidence interval.
Making use of the fact that the population mean is within 2 standard errors of the sMean of most representative datasets, we will take our sMean along with the estimated standard error from our bootstrap distribution, and go two standard errors below the sample mean, and two standard errors above the sample mean. We can be 95% confident that the population mean lies between those values.
HomeData
) directly. Only the estimated standard error comes from the bootstrap distribution.The following is a recap of the process and the code we used to construct a boostrap confidence interval
First, compute the statistic from the data (here, a mean Area)
<- mean(~Area, data = HomeData) sMean_Area
Next, set the random seed and create the bootstrap distribution.
set.seed(5551212)
<-
Bootstrap_MeanArea_s5000 do(5000) *
resample(HomeData, size = 50) %>%
mean(~Area, data = .)
Then, calculate the estimated standard error.
<- sd(~result, data = Bootstrap_MeanArea_s5000) est_SE_MeanArea
The lower and upper endpoints of the interval are the sample statistic plus and minus two standard errors.
<- sMean_Area - 2 * est_SE_MeanArea
CI_lower_meanArea <- sMean_Area + 2 * est_SE_MeanArea CI_upper_meanArea
## rounding, for reporting inline
<- round(sMean_Area, digits = 1)
sMean_Area_rounded <- round(CI_lower_meanArea, digits = 1)
CI_lower_rounded <- round(CI_upper_meanArea, digits = 1) CI_upper_rounded
Based on our dataset consisting of 50 randomly sampled homes, where the mean area was 1509.1, we are 95% confident that the mean area of houses sold in Ames between 2006 and 2010 is between 1374.8 and 1643.4.
Price
) in Ames, using your dataset of 50 houses.Area
and Price
as the sCorrelation plus and minus two standard errors of the correlation. You may need to refer back to Lab 5 to remind yourself how to compute a correlation in R; essentially everything is the same as for a mean, except that instead of calculating a mean of each bootstrap sample, you will calculate a correlation). Do you notice anything different about the shape of the bootstrap distribution of correlations, compared to the distribution of means? (Hint: Plot it!) Does this raise any concerns about the formula we are using for the interval? Explain.SLO D3 (Exercise 12):
+1: The bootstrap code uses resample()
to sample with replacement +1: The bootstrap code samples from the HomeData
sample (as opposed to sampling from the AmesAll
population) +1: The correlation is computed for each bootstrap sample +1: The standard error is computed as the standard deviation of the results in the bootstrap distribution +1: The confidence interval is centered on the correlation in the HomeData sample +1: The margin of error is 2 standard errors +1: The interval is interpreted as being about the correlation between price and area +1: The interval is interpreted as being about the population of all houses in Ames