This lab picks up right where part (a) left off. You should put your work for both parts in a single RMarkdown document called lab4.Rmd
, which you modify from the lab4-template.Rmd
file I provided last time.
The server gets bogged down when many people are using it at once. So to ease the load, I’ve modified the ames.csv
dataset, removing variables that we aren’t going to consider. Read in the data again, overwriting the old Ames
variable, to free up memory on the server.
Last time we simulated taking many samples of 50 houses each from the population of houses in Ames, Iowa, and we calculated how far away the mean living area in the houses in the sample tended to be from the mean living area in all houses in Ames.
In reality we do not have access to the population, nor do we have access to many samples; we only have access to one sample.
So not only do we not know the true population mean, we don’t know how far away sample means tend to be. However, we will see that we can estimate the “typical” discrepancy between a sample statistic and a population parameter (called the standard error) using just our one sample, and then we can use that to give a margin of error to our estimate of the population parameter.
In our case, we want to estimate the mean living area of all houses in Ames, Iowa. This value is our parameter.
We can calculate the mean living area of the 50 houses in our sample. This value is our statistic.
In part (a) of this lab, you found the “typical” discrepancy between the mean of a random sample of 50 houses, and the mean of the population of all houses in the city. This number is called the standard error.
Here’s the formal definition of standard error:
Definition: Standard Error The standard error of a sample statistic is the “typical” discrepancy between that statistic and the corresponding parameter when samples have the same sample size as the dataset used. It is defined as the standard deviation of the sampling distribution of the statistic.
You should have noticed two things in particular in part (a):
Since the standard error is defined as the standard deviation of the sampling distribution, this means that when these two facts hold, the sample mean will be within two standard errors of the population mean 95% of the time (that is, for 95% of possible random samples).
SamplingDistribution5000
data frame. Store the lower endpoint of the interval in an R variable called lower
and the upper endpoint in a variable called upper
.Let’s verify that this worked, by counting the number of sample means that lie inside the interval:
## This line creates a new dataset, filtering the sampling distribution
## to include only those means between our two endpoints
InsideMeans <-
filter(
SamplingDistribution5000,
result < upper, result > lower)
## This line counts the number of rows in the filtered result
numberWithin <- nrow(InsideMeans)
## This line counts the number of rows in the full sampling distribution
## (though we already know that this is 5000)
totalMeans <- nrow(SamplingDistribution5000)
## Now let's compute the proportion within the interval
## (hopefully it's close to 0.95)
numberWithin / totalMeans
The interval we found above was centered on the population mean, and contains (roughly) 95% of possible sample means from random samples.
In reality we only have one sample, and we want to give an interval that most likely contains the population mean.
Since the population mean is usually (95% of the time) within two standard errors of the sample mean, we can construct this interval the same way, but centering it on the sample mean instead of the population mean.
The resulting interval is called a (95%) confidence interval.
Sample50
) and the standard error you found before. Does your interval contain the population mean? We expect that most (but perhaps not all) of the class’s intervals will contain the population mean. Plot your interval on the whiteboard so we can see the class’s results. With 16 intervals in the class, and a 95% “success” rate, there may be one or two that don’t contain the right value.It’s hard to precisely verify whether the claim holds that 95% of samples give a confidence interval that contains the population mean with only 16 intervals. So let’s try to scale up the process.
The following code takes 100 samples of 50 houses each and computes some summary statistics for the Area
variable, including mean, standard deviation, quartiles, etc.
Look at the data frame created — there should be one column for each of the sample statistics computed by favstats()
, and one row for each of the 100 samples drawn.
Let’s add two new statistics: the lower and upper boundaries of a 95% confidence interval. To reduce computation I’m going to use a formula for the standard error, which is related to the standard deviation of the sample and the sample size.
sampleStats <-
mutate(sampleStats,
standardError = sd / sqrt(n),
CI.lower = mean - 2 * standardError,
CI.upper = mean + 2 * standardError)
Type the following to execute an R script from my website that defines a useful plotting function, plot_ci()
, that will show each in a set of confidence intervals along with the true population mean, highlighting those that miss.
### The source() function executes everything in a given R script file
source("http://colindawson.net/misc/plot_ci.R")
Now we’ll call this function on our collection of sample statistics, specifying the true population mean as the value of the mu=
argument.
Note that this function assumes that you have defined the new variables CI.lower
and CI.upper
by those exact names.
You should see 100 intervals stacked, a dashed vertical line at the population mean, and intervals highlighted in red that “missed” the population mean.
How many of the 100 intervals in your simulation missed (were highlighted in red)? Is this what you expected? (Keep in mind that the number of red intervals won’t be the same for everyone, since everyone is using a different random seed, but they shouldn’t be too far off.)
Create two new sampling distributions as above, one using samples of 25 houses each (half the sample size as before) and one using samples of size 100 (twice the sample size as before).
Plot the distribution of sample means for each of the three sample sizes, and compute the mean of means and the standard error.
What do the three distributions have in common? How do they differ? Try to explain why this pattern makes sense.
Generate a sampling distribution of the mean Price
of houses in Ames, using samples of size 50. Use it to estimate the standard error (we’ll need a different way to do this in reality, since we won’t have access to the population; we’ll get to this soon), and construct an interval that should contain the mean price of all houes in Ames with 95% confidence. Check whether it does.