In 1970 during the Vietnam War, the United States Selective Service conducted a lottery to decide which young men would be drafted into the armed forces (Fienberg, 1971). Each of the 366 birthdays in a year (including February 29) was paired with a draft number (from 1 to 366). The first people drafted were those born on the birthday that was paired with draft number 1. Everyone born on the day paired paired with draft number 2 was drafted next, and so on until the target number of draftees was reached.
In this lab, we will investigate whether there is evidence of systematic bias toward certain birthdays in the lottery process that produced the 1970 draft order .
We will regard the 366 dates of the year as the cases. Each date has two variables associated with it: DraftNumber
(the position in the draft order of people with thta birthday), and SequentialDate
in the year (with January 1 being sequential date 1, January 31 being sequential date 31, February 1 being sequential date 32, all the way to sequential date 366 which is December 31.
We can think of SequentialDate
as a fixed explanatory variable, and DraftNumber
as a random outcome which serves as the respone variable.
DraftNumber
and SequentialDate
variables?Let’s read in the data and look at a scatterplot with SequentialDate
on the \(x\)-axis and DraftNumber
on the \(y\)-axis.
## Delete fig.keep = 'none' above when you get here
DraftLottery <- read.file("http://colindawson.net/data/draftlottery.txt")
gf_point(DraftNumber ~ SequentialDate, data = DraftLottery)
Spoiler: It’s difficult to see much of a pattern or association in the scatterplot, so it seems reasonable to conclude that this was a fair, random lottery with a correlation coefficient near zero. But let’s dig a little deeper…
Let’s split the data by month and look at the distribution of draft numbers within each month of birthdays.
## This first line tells R that the month names go in a specific order
DraftLottery <- DraftLottery %>%
mutate(Month = factor(Month, levels = month.name))
gf_boxplot(DraftNumber ~ Month, data = DraftLottery) %>%
gf_refine(coord_flip()) # Makes the boxes horizontal
Looking at the boxplots, do you see any pattern across the months?
The correlation coefficient between SequentialDate
and DraftNumber
is \(r = -0.226\). What does this mean in context? Is it consistent with the pattern you noticed in the last question?
One possible explanation for the negative correlation for this particular drawing might be some form of systematic bias in the lottery process. Is some form of systematic bias the only explanation for a negative sample correlation? If not, what else could it be attributed to?
If we want to determine whether a correlation like this one would be likely to occur by random chance in a fair lottery, we can do a hypothesis test.
What are the null and alternative hypotheses, in words and as statements about a parameter?
How could you simulate a dataset from a hypothetical world in which you know \(H_0\) to be true using pieces of paper (if you had enough of them)? What sample statistic would you record for each simulated sample to construct a randomization distribution?
How would you go about evaluating the strength of the evidence against the null hypothesis and in favor of the alternative hypothesis, once you had your randomization distribution?
Simulating all 366 draft numbers by hand would take forever, but let’s simulate a miniature version of the draft with only 10 birthdays which will be paired with to 10 draft numbers.
You have ten playing cards numbered 1-10, where Ace is 1. These cards will represent our mini set of birthdays.
Simulate a dataset by shuffling the cards and placing them face up on the table in a random order from left to right. The leftmost “birthday” is draft number 1, the second from the left is draft number 2, etc.
Enter the data into R as follows, replacing the ...
s in the SequentialDate
variable with the cards you drew, in order:
## Delete eval = FALSE after you've filled in your results
MiniDraftData <-
data.frame(
SequentialDate = c(..., ..., ..., ..., ..., ..., ..., ..., ..., ...), DraftNumber = 1:10)
Let’s plot your data and compute a correlation in the usual way:
## Delete eval = FALSE when you get here
gf_point(DraftNumber ~ SequentialDate, data = MiniDraftData)
cor(DraftNumber ~ SequentialDate, data = DraftData)
Now repeat your simulation and once again compute a correlation. Write down your two correlation values, then put them on the dot plot on the board.
Based on a draft with just 10 birthdays, how surprising would a correlation of \(-0.226\) be? (That is, what proportion of the class’s simulated datasets yielded a correlation at least as large as that?)
Instead of shuffling and dealing cards, we can quickly get several thousand simulated datasets from a world in which there is no association between SequentialDate
and DraftNumber
. Here’s some R code to do it 5000 times.
## Delete eval = FALSE when you get here
RandomizationDistribution_mini <-
do(5000) *
cor(DraftNumber ~ shuffle(SequentialDate), data = MiniDraftData)
Let’s unpack this code a little bit. As with other simulations, we are first telling R to do()
something 5000
times (with “times” represented by the *
symbol; cute, huh?).
What are we doing in each iteration? Unlike with a sampling distribution or a bootstrap distribution, we’re not taking a random sample. Instead, we’re taking a dataset and randomly reordering (“shuffling”) one of the variables, in a way that has no relationship to the other variable. In this way we’re ensuring that the resulting dataset, while it might have a numerical correlation which isn’t zero, has no structural association between the two variables, and so any correlation that does arise is due purely to random chance.
In other words, the dataset we get most definitely arises from a world in which there is no “real” correlation between the variables (i.e., where the correlation parameter is exactly zero). But of course the correlation statistic (which we are computing in each iteration) will not be exactly zero.
Let’s plot the resulting randomization distribution of correlations:
We can even highlight the randomization correlations that lie beyond \(-0.226\).
## Delete eval = FALSE when you get here
dotPlot(
~cor,
data = RandomizationDistribution_mini,
groups = abs(cor) >= 0.226)
and calculate what proportion of the simulated correlations fall in this category
## Delete eval = FALSE when you get here
tally(~(abs(cor) >= 0.226), data = RandomizationDistribution_mini, format = "proportion")
This is our \(P\)-value!
Of course, a draft with only 10 dates might produce different correlations than a draft with 366 dates. Let’s repeat this process with the actual data, which is in the DraftLottery
data frame.
DraftLottery
data, plot it (highlighting those randomization correlations which are at least as favorable to the alternative hypothesis as the \(-0.226\) that occurred in reality), compute the \(P\)-value, and interpret the results.Once they saw these results, statisticians were quick to point out that something fishy happened with the 1970 draft lottery. The irregularity can be attributed to improper mixing of the balls used in the lottery drawing process (balls with birthdays early in the year were placed in the bin first, and balls with birthdays late in the year were placed in the bin last. Without thorough mixing, balls with birthdays late in the year settled near the top of the bin and so tended to be selected earlier.) The mixing process was changed for the 1971 draft lottery (e.g., two bins, one for the draft numbers and one for the birthdays), for which the correlation coefficient turned out to be \(r = 0.014\).
On an abstract level, many other kinds of hypothesis tests can be described as tests of an association. If we have a binary explanatory variable, for example, which divides our cases into two groups, asking whether the groups are different (in terms of either a proportion, which is based on a binary response variable, or a mean, which is based on a quantitative response variable) amounts to asking whether the explanatory variable and the response variable are associated.
Therefore, if we want to construct a randomization distribution for such a hypothesis test, we can do essentially the same thing we did with the draft data: shuffle one of the variables to create completely random pairings between explanatory and response; that is, to create completely random groupings of the outcomes. If the data came from an experiment, we can think of this shuffling as simulating random assignment of cases into the groups; but even for an observational study it is still sensible to do this.
Let’s repeat the simulation of the penguins and metal bands data, this time in R.
Look at how the data frame is structured by peeking at the data.
We can construct a randomization distribution the same way that we did for the DraftLottery
data, but this time we will use a different statistic: rather than correlation (via the cor()
function), we’ll compute the difference in proportions (via the diffprop()
function). But every thing else (other than the dataset and variable names) is the same.
Note: the diffprop()
function will compute the difference between two groups by putting the group names in alphabetical order and subtracting the second proportion from the first
We can use the exact same procedure to test for a difference in means.
The SleepCaffeine
dataset in the Lock5Data
package comes from an experiment examining memorization ability for two groups of randomly assigned college students: one group took a 90 minute nap; the other had a pill containing an amount of caffeine comparable to a cup of coffee. The response was the number of words recalled from a list the students were asked to memorize.
diffmean()
function computes a difference in means between two groups. It has the same syntax as cor()
and diffprop()
).