In a dataset of 120 soccer matches played in the Football Association (FA) premier league in Great Britain, the home team won 70 times. Let’s examine whether this data provides evidence of a structural advantage associated with playing at home.
If we were using randomization to test the null hypothesis that there is no benefit to playing at home against the alternative that in the long run there is either a benefit or detriment, we would want to simulate datasets of 120 games each. Since the randomization distribution is based on H0 being true, the outcomes of these simulated games should be modeled by coin flips.
<-
Randomization_FAGames do(5000) *
rflip(n = 120, prob = 0.5)
To find a two-tailed \(P\)-value, we want to know what proportion of the 5000 simulations yielded a sample proportion of home team wins of either 70/120 or better, or 50/120 or worse.
<- 0.5
null_prop <- 70/120
observed_prop <- 1 - observed_prop
mirror_image <-
P_value_right prop(~(prop >= observed_prop), data = Randomization_FAGames)
<-
P_value_left prop(~(prop <= mirror_image), data = Randomization_FAGames)
<- P_value_right + P_value_left P_value
This is not needed to find the P-value, but here’s a histogram of the simulated win proportions with those past 70/120 and past 50/120 highlighted.
gf_histogram(
~prop,
data = Randomization_FAGames,
binwidth = 1/120,
fill = ~(prop >= 70/120 | prop <= 50/120)) +
scale_x_continuous(
name = "sProportion of Home Wins",
breaks = seq(0, 1, by = 0.05)) +
scale_fill_discrete(
name = "Counts Toward P-value?",
labels = c("No", "Yes")
)
Let’s repeat the test using a Normal distribution in place of this randomization distribution.
pdist()
function below. Is the result close to the one we got using randomization?pdist("norm", q = 0, mean = 0, sd = 1, lower.tail = TRUE)
## [1] 0.5
We have previously estimated the mean commute time for commuters in the Atlanta area using bootstrap simulation.
library(Lock5Data)
data(CommuteAtlanta)
This time, let’s estimate the mean commute distance.
Here is a graph of the 500 individual commute distances in the dataset.
gf_histogram(~Distance, data = CommuteAtlanta, binwidth = 5)
To get a bootstrap distribution, we resample datasets of 500 commutes each from this dataset, and calculate the mean commute distance for each dataset of 500 commuters.
<-
Bootstrap_MeanCommuteDistances do(5000) *
resample(CommuteAtlanta, size = 500) %>%
mean(~Distance, data = .)
Here is a histogram of the bootstrap distribution of mean commute distances.
gf_histogram(
~result,
data = Bootstrap_MeanCommuteDistances,
binwidth = 0.1) +
scale_x_continuous(
name = "Bootstrap Mean Commute Distance (mi.)",
breaks = seq(16, 21, by = 0.5)
)
quantile(
~result,
prob = c(0.25, 0.75),
data = Bootstrap_MeanCommuteDistances)
## 25% 75%
## 17.716 18.550
Now let’s try to find the interval again using a Normal approximation in place of the bootstrap distribution.
qdist()
function. Replace the arguments in the code below with the appropriate values. Is your interval approximately the same as the one from the bootstrap distribution?<- qdist("norm", p = 0.25, mean = 0, sd = 1, lower.tail = TRUE)
left_endpoint <- qdist("norm", p = 0.25, mean = 0, sd = 1, lower.tail = TRUE) right_endpoint
We have seen that when we have a Normal distribution, there is a consistent relationship between the number of standard deviations away from the mean that a given value is — that is, its z-score — and its percentile.
This means that if we can convert our sample statistic to a z-score within a Normal model of the randomization distribution, then we can get a pretty good idea of the P-value even without the computer.
With a z-score in hand, we could find an exact P-value using a centered and standardized Normal distribution that describes how z-scores are distributed for Normally distributed values.
pdist()
). It should match exactly the P-value you found using the Normal distribution above: this is just a different approach to the same result.“In a sample of 120 games, the win percentage of the home team in the British Premier League was 70/120, or 58.3%. Based on this data, there (was/wasn’t) significant evidence of a home field advantage in the league (\(z\) = , P = , two-tailed).”