Review of Key Concepts

Hypothesis Testing Logic

Often we are interested in assessing the strength of evidence for or against a claim.

In the null hypothesis testing framework, we adopt a “skeptical stance”, going into the study with the mindset that theres “nothing interesting to discover”.

The description of the population/process/phenomenon in which there is “nothing interesting to discover” is called the null hypothesis (abbreviated H0).

The description of the population/process/phenomenon in which there is something interesting to discover is called the alternative hypothesis (abbreviated H1).

To assess how strong the evidence is that there is something there (a relationship between two variables, a phenomenon more interesting than “random guessing”, etc.) we do the following:

  • Decide on a parameter to focus on (such as a mean, a proportion, correlation, difference in means, etc.), which characterizes the population/process/phenomenon of interest

  • State null and alternative hypotheses about that parameter

  • Decide on a corresponding statistic that we will calculate from our data

  • Order the possible values of the statistic from “most supportive of the research claim” (H1) to “l east supportive of the research claim”

  • Assign weights to each possible value which sum to 1, based on the skeptic’s model of the world, in which there is “nothing interesting to discover” (that is, in which H0 is accurate)

  • Calculate the actual value of the statistic from the data

  • Find the combined weight of all of the possible values of the statistic which were ranked ahead of or equivalent to the calculated value

  • This combined weight is called the “P-value” and is one measure of the evidence about the hypotheses

Learning Goal of this Lab

  • Carry out a hypothesis test from start to end, identifying the components of the test along the way
  • See how to create a randomization distribution, both in StatKey and in R

The Scenario: “(Male) Love is Blind”

Note: In all of the descriptions that follow, I will use the terms “male” and “female” to indicate a gender self-identification. In the study in question, the participants were male-identifying adults (“males”) who were in romantic relationships with a female-identifying adult (“female”).

Eighteen males in romantic relationships with female partners were recruited for a study whose goal was to assess the ability of male partners to recognize their romantic partner by touching their hand, without visual input.

The males were blindfolded, and asked to touch the backs of the hands of each of three female adults, one of whom was the participant’s romantic partner. The two “decoys” were the same age, height, and weight as the participant’s partner.

Each male was then asked to identify which hand belonged to their partner. Each of the 18 responses was coded as either “correct” or “incorrect”.

The question of interest is: How much evidence does the data provide that members of the target population can identify their partner by touching their hand?

Identifying Components

  1. What are the cases in this study? What is the response variable? Is it quantitative or categorical?


  1. What is the parameter of interest? What population/process/phenomenon does it characterize?


  1. What statistic can be used to estimate the population parameter? What cases does it describe?


Setting up the Hypothesis Test

  1. State the the null and alternative hypotheses, as qualitative claims about the population (that is, in words).


  1. State the null and alternative hypotheses as quantitative claims about the population parameter (that is, as equations or inequalities).


  1. List the possible values of the statistic for this study, in order from “most supportive of the alternative hypothesis” to “least supportive of the alternative hypothesis”.


  1. Suppose you had a deck of cards. Propose a method for simulating one dataset of the same sample size as in the actual study, based on the process as the null hypothesis characterizes it.


A “Virtual” Card Game

The person sharing their screen in your group will play the role of the 18 blindfolded participants for this game.

We are going to set up the game such that there is no way for the blindfolded participants to know the correct answer; that is we are creating a simulation in which the null hypothesis is definitely correct – that is to say that the participant definitely is just guessing randomly.

We will label the three “hands” as 1, 2 and 3. For each participant, the partner’s hand (the correct answer) will be chosen randomly from these three individuals.

The Guesser’s Job

In the middle column below, write down a list of 18 numbers, each either 1, 2, or 3.

Trial Guess Correct?

The Director’s Job

Someone who is not sharing their screen should do this part. If you have three people in the group, one person can do 1-9 and the other 10-18.

Set a unique seed, and then run the code chunk below to create the list of correct answers.

set.seed(1), size = 18, replace = TRUE)

Once the guesser is done picking their numbers, tell them which ones were correct, and then both of you should fill in the table.

Round 2

Now switch roles and repeat the experiment. Whoever is the guesser now should fill in their table above, and then their partner(s) should run the code above (after changing the seed), and let the guesser know which responses were correct.

Compiling the Results

  1. In the Google form here, enter your name and the proportion of simulated responses (out of 18, as a number between 0 and 1) that were correct when you were the reader.

Nothing to Write

  1. Modify the line that says “fill = ~(name == "Colin")” to put your name instead (as you entered it in the Google Form) and then Run the code chunk below to see everyone’s responses. As usual for these Google Form plots, I don’t expect you to know how to write code like what is in this chunk.
url <- ""
download.file(url, destfile = "love-is-blind-randomization.csv")
ClassRandomizationDistribution_LoveIsBlind <- read.file("love-is-blind-randomization.csv")
R <- nrow(ClassRandomizationDistribution_LoveIsBlind)
## Replace my name with yours in fill = (Name == "Colin"). This
## will color your dot a different color than the others.
    data = ClassRandomizationDistribution_LoveIsBlind, 
    fill = ~(Name == "Colin"), 
    binwidth = 1/36,
    method = "histodot",
    ylab = "Number of Simulations") +
    name   = "Simulated Proportion Correct by Random Guessing (rProportion)",
    limits = c(0,1),
    breaks = seq(from = 0, to = 1, by = 1/18) %>% round(digits = 2),
    labels = seq(from = 0, to = 1, by = 1/18) %>% round(digits = 2)) +
    name   = "# of Sets of 18 Simulated Participants",
    breaks = NULL,
    labels = NULL

The Actual Result

In the real study 8 out of 18, or about 44% of the participants identified their partner correctly.

  1. What proportion of the students in the class produced a simulation with an outcome at least as favorable to the alternative hypothesis as this? This proportion of simulations (between 0 and 1) is the P-value associated with the result and hypotheses.


Note: Many people will have answered this question before all of the simulations from the class had been completed, so answers will vary.

  1. Do larger P-values or smaller ones provide more convincing evidence that male partners in the target population are better than chance at this task? Make sure you understand what the P-value represents — it is not the proportion of correct answers in the study! Explain your reasoning.


Scaling Up the Simulation (StatKey)

  • Go to StatKey and select “Sampling Distributions (Proportion)

  • Select “Edit Proportion” and enter the hypothetical “long run” proportion of correct responses according to \(H_0\).

  • Select the appropriate sample size by setting \(n\).

  • Simulate several thousand datasets based on fully random guessing.

  • Select either the left-tail, right-tail or two-tail checkbox to highlight the outcomes that would appear to most strongly favor the alternative hypothesis.

  • Change the value below the x-axis to the actual result from the real study (not your simulation) so that the simulations which produced results as or more favorable to the alternative as the actual result are highlighted in red.

  1. What proportion of the simulated results appeared at least as favorable to \(H_1\) as the 8 out of 18 correct in the real study? That is, what is the \(P\)-value for this result based on a large-scale simulation?


HOMEWORK: Scaling Up the Simulation (R)

Constructing a randomization distribution in R is similar to constructing a sampling or bootstrap distribution: we generate a few thousand datasets from some process, and for each dataset, compute the statistic of interest.

  1. In the code chunk below, the values of n= and prob= are set based on the data and hypotheses from the tea-tasting experiment. First replace the seed with a choice of your own. Then, replace the value of the n= argument with the sample size (that is, the number of guesses made in the “Love is Blind” study) associated with each simulated dataset, and replace the value of the prob= argument with the hypothetical long-run success rate in the Love is Blind study according to the skeptic (H0).


RandomizationDistribution_LoveIsBlind <- 
  do(5000) *                     ## We simulate 5000 taste tests
      rflip(n = 10, prob = 1/2)  ## each simulated taste test involves 10 coin flips with success chance 1/2

  1. Graph the prop variable in the randomization distribution you just created. In the code chunk below, the bin width and color-coding cutoff are chosen based on the tea-tasting experiment. Change the value of binwidth= to the difference in proportions corresponding to one additional correct guess (this is the smallest possible distinction for this study and sample size), and change the value 8/10 in the fill= argument to the actual value of the statistic in the real study so that the P-value corresponds to the proportion of the area shaded in blue.


  data     = RandomizationDistribution_LoveIsBlind,
  fill     = (~prop >= 9/10),
  binwidth = 1/10,
  xlab     = "Simulated Proportion Correct by Random Guessing (rProportion)",
  ylab     = "# of Sets of 18 Simulated Participants")

  1. To calculate the actual P-value, modify the code below by changing the threshold to which the simulated proportions are being compared. The prop function is being used in two ways: the inner one checks for each simulation whether the proportion of correct guesses in that simulation was at least as good as the real one. The outer one finds the proportion of simulations that meet this criterion.


P_value_LoveIsBlind <- prop(~(prop >= 9/10), data = RandomizationDistribution_LoveIsBlind)

  1. How surprised should the skeptic be by this result? Find the reciprocal of the P-value, which represents the number of times the skeptic would expect this study to need to be repeated before seeing a result as favorable to the alternative hypothesis as this. The higher this number, the more surprising this result.
