Learning Goals

  • Practice finding margins of error and P-values using Normal Distributions
  • Convert parameters and statistics to and from z-scores

Example 1: Home Field Advantage in the British FA Premier League

In a dataset of 120 soccer matches played in the Football Association (FA) premier league in Great Britain, the home team won 70 times. Let’s examine whether this data provides evidence of a structural advantage associated with playing at home.

  1. What are the parameter and statistic of interest here? What are the null and alternative hypotheses?

Response

If we were using randomization to test the null hypothesis that there is no benefit to playing at home against the alternative that in the long run there is either a benefit or detriment, we would want to simulate datasets of 120 games each. Since the randomization distribution is based on H0 being true, the outcomes of these simulated games should be modeled by coin flips.

Randomization_FAGames <-
  do(5000) *
    rflip(n = 120, prob = 0.5)

To find a two-tailed \(P\)-value, we want to know what proportion of the 5000 simulations yielded a sample proportion of home team wins of either 70/120 or better, or 50/120 or worse.

null_prop <- 0.5
observed_prop <- 70/120
mirror_image <- 1 - observed_prop
P_value_right <- 
  prop(~(prop >= observed_prop), data = Randomization_FAGames)
P_value_left <-
  prop(~(prop <= mirror_image), data = Randomization_FAGames)
P_value <- P_value_right + P_value_left
  1. What is the conclusion of the test in context, using a significance level of 5%?

SOLUTION

This is not needed to find the P-value, but here’s a histogram of the simulated win proportions with those past 70/120 and past 50/120 highlighted.

gf_histogram(
  ~prop, 
  data = Randomization_FAGames, 
  binwidth = 1/120, 
  fill = ~(prop >= 70/120 | prop <= 50/120)) +
  scale_x_continuous(
    name = "sProportion of Home Wins",
    breaks = seq(0, 1, by = 0.05)) +
  scale_fill_discrete(
    name = "Counts Toward P-value?",
    labels = c("No", "Yes")
  )

Let’s repeat the test using a Normal distribution in place of this randomization distribution.

  1. What should the mean and standard deviation of this Normal approximation be? (Don’t guess from the histogram — calculate the actual values)

SOLUTION

  1. Find the two-tailed P-value using this Normal approximation by plugging in the correct values to the pdist() function below. Is the result close to the one we got using randomization?
pdist("norm", q = 0, mean = 0, sd = 1, lower.tail = TRUE)

## [1] 0.5

Example 2: Estimating Atlanta Commute Times

We have previously estimated the mean commute time for commuters in the Atlanta area using bootstrap simulation.

library(Lock5Data)
data(CommuteAtlanta)

This time, let’s estimate the mean commute distance.

Here is a graph of the 500 individual commute distances in the dataset.

gf_histogram(~Distance, data = CommuteAtlanta, binwidth = 5)

To get a bootstrap distribution, we resample datasets of 500 commutes each from this dataset, and calculate the mean commute distance for each dataset of 500 commuters.

Bootstrap_MeanCommuteDistances <-
  do(5000) *
    resample(CommuteAtlanta, size = 500) %>%
    mean(~Distance, data = .)

Here is a histogram of the bootstrap distribution of mean commute distances.

gf_histogram(
  ~result, 
  data = Bootstrap_MeanCommuteDistances, 
  binwidth = 0.1) +
  scale_x_continuous(
    name = "Bootstrap Mean Commute Distance (mi.)",
    breaks = seq(16, 21, by = 0.5)
  )

  1. To find a 90% confidence interval, we can find the ___ and ___ percentiles of this distribution. Fill in the appropriate proportions by editing the code below.
quantile(
  ~result, 
  prob = c(0.25, 0.75), 
  data = Bootstrap_MeanCommuteDistances)
##    25%    75% 
## 17.716 18.550
  1. Interpret the interval in context.

RESPONSE

Now let’s try to find the interval again using a Normal approximation in place of the bootstrap distribution.

  1. What should the mean and standard deviation be for this Normal?

RESPONSE

  1. Find the corresponding quantiles of the Normal model using the qdist() function. Replace the arguments in the code below with the appropriate values. Is your interval approximately the same as the one from the bootstrap distribution?
left_endpoint <-  qdist("norm",  p = 0.25, mean = 0, sd = 1, lower.tail = TRUE)
right_endpoint <- qdist("norm", p = 0.25, mean = 0, sd = 1, lower.tail = TRUE)

HOMEWORK: Using z-scores and a Standardized Distribution Model

We have seen that when we have a Normal distribution, there is a consistent relationship between the number of standard deviations away from the mean that a given value is — that is, its z-score — and its percentile.

This means that if we can convert our sample statistic to a z-score within a Normal model of the randomization distribution, then we can get a pretty good idea of the P-value even without the computer.

  1. Convert the sProportion of home wins in the FA Premier League data to a z-score within the Normal distribution we used in place of the randomization distribution. This value is called a test statistic.

SOLUTION

With a z-score in hand, we could find an exact P-value using a centered and standardized Normal distribution that describes how z-scores are distributed for Normally distributed values.

  1. What should the mean of a set of z-scores be? (Hint: what is the z-score of the mean of a distribution?) What should the standard deviation of a set of z-scores be? (Hint: what is the z-score of a value which is one standard deviation above the mean?)

SOLUTION

  1. Find the P-value associated with the test statistic you found above by using this standard Normal distribution (use pdist()). It should match exactly the P-value you found using the Normal distribution above: this is just a different approach to the same result.

SOLUTION

  1. Results of hypothesis tests in scientific articles are typically reported as follows. Fill in the values of the test statistic and P-value.

“In a sample of 120 games, the win percentage of the home team in the British Premier League was 70/120, or 58.3%. Based on this data, there (was/wasn’t) significant evidence of a home field advantage in the league (\(z\) = , P = , two-tailed).”