This lab was modified by Colin Dawson from source material by Andrew Bray, Mine Çetinkaya-Rundel, and the UCLA statistics department which accompanies the OpenIntro statistics textbooks. This handout as well as the source material is covered by a CreativeCommons Attribution-ShareAlike 3.0 Unported license.
In this lab we will use simulation to construct sampling distributions using a large dataset as a “stand-in” for an actual Population or Data-Generating Process.
The learning goals are:
Some of the terms in the overview paragraph were only recently defined in class. As a reminder, here are their definitions.
Definition: A population (or data-generating process) consists of all potential cases of interest. Data-generating processes have parameters (like a mean, median, standard deviation, correlation, etc.) that summarize some property of the process (notice that all the words that start with ‘p’ go together!).
Definition: Datasets (or samples) are snapshots produced by these data-generating processes (possibly via collecting a sample from a population), and can be summarized with statistics (mean, median, standard deviation, correlation, etc.). (Most of these are ‘s’ words!) Each dataset produced by a process is different, even if nothing about the process has changed, as are their statistics.
Definition: Statistics summarize characteristics of datasets/samples, and are used to estimate the value of the corresponding parameter from the underlying process/population. When the data-collection method is unbiased, the potential for overestimation is balanced by the potential for underestimation.
Definition: To understand how good our estimates are, we want to investigate the sampling distribution of our statistic, which consists of a collection of values of that statistic, each one calculated using a different dataset produced by the process. Sampling distributions are “hypothetical” rather than concrete, in that we will never actually have direct access to one in real life; but we can still study their hypothetical properties, and estimate what they would look like.
In reality we almost never have access to the underlying process or the full population. But for this lab, to make things concrete, we will use a large dataset as a stand-in for a population.
The data consists of thousands of real estate transactions from the city of Ames, Iowa, as recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection is a stand-in for our population of interest.
First, let’s load the mosaic
package and the data.
library(mosaic) # Redundant, since it's loaded above, but doesn't hurt
<- read.file("http://colindawson.net/data/ames.csv") AmesAll
This is a reduced dataset with just two variables:
Variable | Units | Description |
---|---|---|
Area |
Square Feet | Total above-ground living area of the house |
Price |
US Dollars | Sale price of the home |
We will focus on the variable Area
for now.
Gathering information on an entire population is often extremely costly or impossible. Instead, we often take a sample (a “snapshot”) of the population and use that to understand the properties of the population.
Suppose we were interested in estimating the mean living area of houses in Ames.
We might survey a random sample of, say, 50 homes in the city and collect various information about each home. The full population contains 2930 homes, so we will have data on less than 3% of the population, but it will turn out that we can make some decent estimates about the population, provided our sample is representative.
Let’s create a dataset consisting of 50 houses randomly sampled from the population. Again, the AmesAll
data frame represents our population, which we would not normally have access to. The 50 houses we sample represent an actual dataset we would have.
The sample()
command takes a subset of cases from a data frame.
The value of the size=
argument specifies how many cases will be in our dataset.
The result is a data frame consisting of the sampled cases.
We will store the result in a named variable called HomeData
(we could pick any name we want).
In this lab we will be doing some stochastic (random) simulations, like sampling many times from a population.
This will involve commands that take random samples from larger datasets.
Since the sample is random, it will change every time you re-run the command. This can be annoying if you are trying to describe the results in text.
To avoid this issue, you can include the following line of code at the start of a chunk that does random simulation to initialize R’s random number generator. That will cause it to give the same results when you re-run or Knit your document.
## You can pick any number here; I've used my T number
## This should go before any line that involves doing something random.
## Each person in the group should use a different number so you can get different results!
## This is important for the goal of seeing how different possible datasets differ
set.seed(00029747)
<- sample(AmesAll, size = 50) HomeData
Area
variable in your HomeData
dataset (you’ll want to plot them first – choose an appropriate plot for this type of variable). You can refer to Lab 4 for examples of plot commands, but remember that most of them have the form gf_typeofplot(~VariableName, data = DatasetName)
for single variable plots, or gf_typeofplot(YVariable~XVariable, data = DatasetName)
for plots involving two variables.HomeData
dataset? What would you interpret “typical” to mean in this context? (Remember that different group members have slightly different data due to the random sampling process)You may have chosen to use the mean or the median as a “typical” value in the distribution of areas. Either is a reasonable choice. So that we’re all on the same page, let’s focus on the mean.
## I'm using `sMean` as a shorthand for 'sample mean', or 'dataset mean',
## or 'snapshot mean'. I've also included
## the name of the variable I'm taking the mean of, so I can remember
## what this value represents.
<- mean(~Area, data = HomeData) sMean_Area
sMean_Area
object appear in your Environment tab. Write a sentence giving its value, rounded to one decimal place. (Optional: Use the inline reference syntax we learned in the last lab so that you can just type the variable name in your sentence and have R replace it with its value for you. To round, you can use round(OriginalValue, digits = 1)
. This command can go inline right in your sentence, or you can create a new variable first with the rounded value)AmesAll
dataset. How far away is your statistic from the parameter it is an estimate of? Is the sampling error (the statistic minus the parameter) positive (the statistic is an overestimate) or negative (the statistic is an underestimate)?Once a few people have entered their results, you can see a graph of everyone’s values by running the following code to download and import everyone’s data from the Google Sheet. You might get a warning about an “incomplete final line”, but it should still make the plot.
(Don’t worry about the details of this code chunk unless you’re really interested; it’s not here because I expect you to learn how to do it; just to get the plot to look at)
<- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQLgaH2TUTjZmEjaMJ3eJlikW3QtmgzMH08rHaUdTxOCa8ZpIKF0mko-lIWFscWhLRdgtsvUwrMQM6B/pub?gid=1286280604&single=true&output=csv"
url download.file(url, destfile = "class_simulation.csv")
<- read.file("class_simulation.csv")
smallSamplingDist ## Replace my name with yours in fill = ~(Name == "Colin") below. This
## will color your value a different color than the others.
= mean(~Area, data = AmesAll) %>% round(digits = 1)
pMean gf_histogram(
~sMean,
data = smallSamplingDist,
fill = ~(Name == "Colin"),
binwidth = 10,
xlab = "Mean Area of 50 Houses",
ylab = "Number of Data Sets") +
geom_vline(xintercept = pMean) +
annotate("label", x = pMean, y = 2, label = paste("pMean =", pMean))
Depending on which 50 homes your dataset happens to contain, your estimate of the mean living area in Ames will either be a bit above or a bit below the mean living area of all homes in Ames.
In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.
In this lab, because we are working with a stand-in for the population, which consists of actual data, we can see how far off a mean from 50 randomly selected houses tends to be from the target parameter by repeating the above steps many times.
Computing a sample mean is easy: just get the sample, and compute the mean, as you did above.
With the speed of modern computers, it is easy to simulate sampling many times from a population.
For example, we can simulate collecting 10 different datasets, each consisting of 50 houses, and looking at how the statistic (the sMean
) varies across these 10 datasets:
The following code will collect 10 datasets of 50 houses each, simulating replicating the data-collection process 10 times.
For each of the 10 datasets we will compute the mean Area
, and then store the collection of 10 means in a new data frame object.
<- do(10) * # A cute shorthand for Do <X> 10 times
SamplingDistribution_n50_s10 sample(AmesAll, size = 50) %>% # This line collects a dataset of 50 and passes it on
mean(~Area, data = .) # This line computes the mean Area in that dataset
## After all is said and done, do(10) * sample(...) %>% mean(...) returns a data frame
## consisting of s = 10 mean areas, each from a different subset of n = 50 houses. The resulting
## data frame is stored in a container called SamplingDistribution_n50_s10
mosaic
noteThe do()
function in mosaic allows you to repeat some code a specified number of times, and store the results of each iteration in a data frame.
The “pipe” (%>%
) operator, passes the data frame produced by the sample()
function on to the mean()
function as the data argument, where it is represented by the .
(you should leave this as a literal period).
Each time we iterate, we are doing the same thing: take a random subset of 50 houses in Ames, and compute the mean Area
of the 50 properties in the sample.
Since each subset is random, we wind up with 10 different subsets of size 50, which may overlap with each other, and 10 different means, which we store in a data frame called SamplingDistribution_n50_s10
. The data frame has just one column, called result
, which contains the means of each sample.
SamplingDistribution_n50_s10
data frame. Verify that the column is called result
, and the values look similar to the sample means from the Google Form.Let’s plot the resulting distribution of means with a histogram.
gf_histogram(
~result,
data = SamplingDistribution_n50_s10,
binwidth = 10,
xlab = "Mean Area of 50 Houses",
ylab = "Number of Datasets") +
## The next two lines just add the vertical line at the pMean, and
## then add the text label
geom_vline(xintercept = pMean) +
annotate("label", x = pMean, y = 2, label = paste("pMean =", pMean))
SamplingDistribution_n50_s10
is called result
)c