STAT 113 Lab 4a: Sampling Distributions

Partnership Roles

The person in column “A” will be the “reader”. Your job is to have this document open, read it aloud.
The other (in column “A”) is the “scribe”. You are the one who, for this assignment, does the typing. The scribe can ask the reader to stop, and can ask clarification questions, but should type only what the reader tells them to type.
In places where you need to write your own code, discuss it together, but then the reader should make the final call as to what gets typed.

This should help you slow down, read carefully, pay attention to detail, and make sure you understand what you’re reading and writing.

Scribe Only: Log in to the Oberlin RStudioPro Server and Create a Project and Markdown File

Log in at rstudiopro.oberlin.edu.
Click the Project dropdown menu in the top right of the RStudio window and select New Project. Call the project lab4 and make a directory for it inside your stat113 folder.
Open the file lab4-template.Rmd file (back in the stat113 folder) and immediately save it to stat113/lab4 as lab4.Rmd. Read the Markdown instructions in the file before moving on.

Attribution

This lab was modified by Colin Dawson from source material by Andrew Bray, Mine Çetinkaya-Rundel, and the UCLA statistics department which accompanies the OpenIntro statistics textbooks. This handout as well as the source material is covered by a CreativeCommons Attribution-ShareAlike 3.0 Unported license.

Summary of This Lab

The goal of this lab is to explore a simulation-based method of computing standard errors of a statistic like a sample mean, which we use to create confidence intervals about a parameter, like a population mean.

What to Turn In

Combine your work from today’s lab with your work from part (b) on Thursday in a single lab4.Rmd file, which you will Knit into a lab4.html file. Knit every so often; don’t wait until the end, or you are likely to encounter errors that would have been easier to fix earlier on. Turn in both files to the ~/stat113/turnin/lab4/ folder on the RStudioPro server by Friday at 5pm.

Some Important Definitions

Some of the terms in the overview paragraph were only recently defined in class. As a reminder, here are their definitions.

Definition: A population consists of all the cases of potential interest. Populations have parameters (like a mean, median, standard deviation, etc.) that summarize some property of the population

Definition: A sample consists of a subset of cases from the population; ideally a representative (perhaps random) subset. Summary values like the mean, median, standard deviation, etc. are called statistics when they are calculated on samples. We typically use statistics as estimates of their corresponding parameter.

Definition: To understand how good our estimates are, we want to investigate the sampling distribution of our statistic, which consists of values of that statistic each calculated using a different random sample drawn from the population.

The Data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames are recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the mosaic package and the data.

library(mosaic)
Ames <- read.file("http://colindawson.net/data/ames.csv")

We will focus on the variable Area, which contains the total above-ground living area in square feet.

Sampling from the Population

In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.

Suppose we were interested in estimating the mean living area in Ames. We might survey a random sample of, say, 50 homes in the city and collect various information about each home. The full dataset contains 2930 homes, so we will have data on less than 3% of the population, but it will turn out that we can make some decent estimates about the population, provided our sample is random.

Let’s take a sample of size 50 from the population and compute the mean Area in the sample. The sample() command takes a simple random sample of a specified size from a data frame. The result is a data frame consisting of the sampled cases. We will store the result in a named variable called Sample50 (we could pick any name we want).

  An R tip when doing random simulations:
  
  In this lab we will be doing some stochastic (random)
  simulations, like sampling many times from a population.
  This will involve commands that take random samples from larger datasets.
  Since the sample is random, it will change
  every time you re-run the command.  This can be annoying if you are
  trying to describe the results in text.  To avoid this issue, you
  can include the following line of code at the start of your script
  or Markdown document to initialize the random number generator the
  same way every time you re-run or Knit your document (as long
  as the code between this line and the random line stays the same):

## You can pick any number here; I've used my T number
## This needs to go before any line that samples.
set.seed(00029747)

Sample50 <- sample(Ames, size = 50)

Describe the distribution of Areas in your sample (you’ll want to plot them first; be sure to load the ggformula package so you can use the gf_ functions we’ve been using to plot).
What would you say is the “typical” size within your sample? What would you interpret “typical” to mean in this context?
Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

You may have chosen to use the mean or the median as a “typical” value in the distribution of areas. Either is a reasonable choice. So that we’re all on the same page, let’s focus on the mean.

sampleMean <- mean(~Area, data = Sample50)

If you run the above code using the “Run Chunk” button, you should see the new sampleMean object appear in your Environment tab. Write down its value.
Now find the population mean from the full Ames dataset. How far away is your sample mean from the population mean? Is the sampling error (the difference between the sample mean and population mean) positive or negative?
There are two dotplots on the whiteboard: one for the sample means and one for the sampling errors. Add dots for your results to the board. What is the approximate average sampling error for the class as a whole? What is a “typical” distance between the sample means and the population mean?

Depending on which 50 homes your sample happens to contain, your estimate could be a bit above or a bit below the true population mean. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.

Scaling Up the Simulation

In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean directly by repeating the above steps many times.

Computing a sample mean is easy: just get the sample, and compute the mean, as you did above.

With the speed of modern computers, it is easy to simulate sampling many times from a population. For example, we can simulate drawing 10 samples each of size 50 from the population and looking at how the sample mean varies across these 10 samples:

The following code will take 10 samples each of size 50 from the population, and compute the mean Area for each sample, storing the resulting means in a new data frame.

SamplingDistribution10 <- do(10) * 
  sample(Ames, size = 50) %>% 
    mean(~Area, data = .)

R and mosaic note:

The do() function in mosaic allows you to repeat some code
a specified number of times, and store the results of each iteration in
a variable in a data frame.

The "pipe" (%>%) operator, passes the data frame produced by the sample()
function on to the mean() function as the data argument, where it 
is represented by the .

Each time we iterate, we are doing the same thing: take a random sample of 50 properties in Ames, and compute the mean Area of the 50 properties in the sample.

Since each sample is random, we wind up with 10 different samples of size 50, and 10 different means, which we store in a data frame called SamplingDistribution10. The data frame has just one column, called result, which contains the means of each sample.

Examine the SamplingDistribution10 data frame and verify that the column is called result and the values look similar to the sample means we plotted on the board.

Let’s plot the resulting distribution of means with a dot plot

## The default behavior of gf_dotplot() is a bit weird, so I'm adding some
## tweaks to make the resulting plot less confusing, making the
## bins evenly spaced with width 10 units of mean Area, and removing
## the nonsensical y axis
## Unfortunately this does make the code itself more complicated.
gf_dotplot(~result, data = SamplingDistribution10, method = "histodot", binwidth = 10) %>%
  gf_refine(scale_y_continuous(NULL, breaks = NULL))

Compute the “mean of means” for your sampling distribution of 10 samples. Is it close to the population mean?
Scale up the simulation to produce sampling distribution of means using 20 samples, 100 samples, and 5000 samples. What do you notice about the shape of the sampling distributions as you start to add more and more sample means to it?
Using your sampling distribution of 5000 sample means, find a “typical” distance between sample mean and population mean. (Hint: Use a measure of variability we’ve discussed. You can refer to “Summarizing a Quantitative Variable” from Lab 2 for code to compute various measures, or look up the function name on the “Most Frequently Used R Commands” cheatsheet on the Labs tab on the course website.)
What proportion of sample means would you expect to be within your “typical” distance of the population mean? (Hint: Use your answer to exercise 8)
Create a sampling distribution of 5000 sample means, but this time include 200 cases in each sample. Plot the results and compute the “typical” distance between sample mean and population mean. With 200 cases in each sample, what proportion of sample means do you expect to be within this “typical” distance?