Some students were interested in the effect of acidic environments on plant growth. They planted alfalfa seeds in fifteen cups, randomly assigning five to be watered with plain water, five to receive a solution containing moderate concentration of hydrochloric acid (1.5 moles per liter) and five to get a stronger acid solution (3 moles per liter).
The cups were arranged indoors near a window in five rows of three with one cup from each acid level in each row (with row a
nearest the window, and row e
farthest away). The response variable was average Height
of the alfalfa sprouts (I assume in centimeters, though the dataset doesn’t specify) after four days.
The predictor of interest is Acid
, which has three categorical levels: water
, 1.5HCl
and 3.0HCl
.
We could write the prediction equation for a one-way ANOVA model as follows:
\[ Height_i = \mu_{Overall} + \alpha_{Acid_i} + \epsilon_i \]
The individual heights, organized by acidity condition and row, are as follows:
Acid
Row 1.5HCl 3.0HCl water
a 1.00 1.03 1.45
b 0.70 1.22 2.79
c 1.37 0.45 1.93
d 2.80 1.65 2.33
e 1.46 1.07 4.85
Since it’s likely that position relative to the window (Row) is a relevant influence on height, if we do not include it as a predictor in the model, it will introduce non-independence among the residuals for the sprouts in the same row. So even though it is not the target of study, it should be included in the model so that the inferences we make about Acid are valid.
A two-predictor model which is effect coded (that is, where the “intercept” represents the overall mean rather than a reference mean) would look like this:
\[ Height_i = \mu_{Overall} + \alpha_{Acid_i} + \beta_{Row_i} + \epsilon_i \]
with individual parameters \(\mu_{Overall}\), \(\alpha_{water}\), \(\alpha_{1.5HCl}\), \(\alpha_{3.0HCl}\), \(\beta_a\), \(\beta_b\), \(\beta_c\), \(\beta_d\) and \(\beta_e\).
So for example, if plant \(6\) received plain water and is in row \(b\), its prediction equation would be:
\[ Height_6 = \mu_{Overall} + \alpha_{Water} + \beta_{c} + \epsilon_6 \]
because \(Acid_6 = Water\) and \(Row_6 = c\).
As before the \(\alpha\)s represent deviations between the mean of each Acid group and the overall mean.
Similarly, the \(\beta\)s represent deviations between the mean in reach Row group and the overall mean.
We can get the Row and Acid means, as well as the overall mean, as follows:
library(Stat2Data); data(Alfalfa)
overall_mean <- mean(~Ht4, data = Alfalfa)
acid_means <- mean(Ht4 ~ Acid, data = Alfalfa)
row_means <- mean(Ht4 ~ Row, data = Alfalfa)
With a one-predictor ANOVA model, we could write the deviation of each observation from the overall mean as
\[ Y_i - \mu_{Overall} = \alpha_{X_i} + \epsilon_i \]
We derived a decomposition of the total sum of squared deviations (total “Sum of Squares”) as follows:
\[ \sum_{i=1}^N (Y_i - \mu_{Overall})^2 = \sum_{i=1}^N \alpha_{X_i}^2 + \sum_{i=1}^N \epsilon_i^2 \]
We rewrote the sum over squared alphas as a double summation, letting \(N_k\) represent the number of observations in group \(k\) and indexing each observation with its group (\(k\)) and its case number within its group (\(j\)). So we had
\[ \sum_{i=1}^N (Y_i - \mu_{Overall})^2 = \sum_{k=1}^K \sum_{j=1}^{N_k} \alpha_{k}^2 + \sum_{i=1}^N \epsilon_i^2 \]
which, after using the fact that the \(\alpha\)s didn’t depend on \(j\), simplified to:
\[ \sum_{i=1}^N (Y_i - \mu_{Overall})^2 = \sum_{k=1}^K N_k \alpha_{k}^2 + \sum_{i=1}^N \epsilon_i^2 \]
We can do something similar with the two-predictor model:
\[ \sum_{i=1}^N (Y_i - \mu_{Overall})^2 = \sum_{i=1}^N \alpha_{Acid_i}^2 + \sum_{i=1}^N \beta_{Row_i}^2 + \sum_{i=1}^N \epsilon_i^2 \]
defining the left-hand side as \(SS_{Total}\) as before, and breaking apart \(SS_{Model}\) into two pieces:
\[ SS_{Acid} = \sum_{i=1}^N \alpha_{Acid_i}^2 \] and
\[ SS_{Row} = \sum_{i=1}^N \beta_{Row_i}^2 \]
Then, as before, we have
\[ SS_{Residuals} = \sum_{i=1}^N \epsilon_i^2 \]
We can find \(SS_{Total}\) as follows:
overall_deviations <- with(Alfalfa, Ht4 - overall_mean)
SSTotal <- sum(overall_deviations^2)
SSTotal
## [1] 17.1066
To check that our decomposition is valid, let’s find the sum of squared residuals directly.
Here are the actual data values (since there is only one observation in each combination of Acid and Row, the mean at that combination is just that data value):
actual_values <- mean(Ht4 ~ Row + Acid, data = Alfalfa) %>%
matrix(
nrow = 5,
ncol = 3,
dimnames = list(
Row = c("a","b","c","d","e"),
Acid = c("1.5HCl", "3.0HCl", "water")))
actual_values
## Acid
## Row 1.5HCl 3.0HCl water
## a 1.00 1.03 1.45
## b 0.70 1.22 2.79
## c 1.37 0.45 1.93
## d 2.80 1.65 2.33
## e 1.46 1.07 4.85
alphas <- acid_means - overall_mean
betas <- row_means - overall_mean
## outer() applies a function to each combination of values in its first two arguments
fitted_values <- overall_mean + outer(betas,alphas,"+")
fitted_values
## 1.5HCl 3.0HCl water
## a 0.886 0.504 2.09
## b 1.296 0.914 2.50
## c 0.976 0.594 2.18
## d 1.986 1.604 3.19
## e 2.186 1.804 3.39
And here are the residuals
## Acid
## Row 1.5HCl 3.0HCl water
## a 0.114 0.526 -0.64
## b -0.596 0.306 0.29
## c 0.394 -0.144 -0.25
## d 0.814 0.046 -0.86
## e -0.726 -0.734 1.46
So then \(SS_{Residuals}\) should be
Does this match the value you found using the variance decomposition?
With a one-way model, we noted that the alphas will always sum to zero if they are appropriately weighted by sample size first, because the overall mean is a weighted average of the group means, so the average difference between the group means and the overall mean is zero.
The same thing holds true here, for both the alphas and betas.
Source | df | SS | MS | F |
---|---|---|---|---|
Acid | ||||
Row | ||||
Residuals | N/A | |||
Total | N/A |
We can get a P-value for an F-statistic as follows, provided the residuals satisfy the standard independence, Normality and constant variance conditions (plugging in the F value and the degrees of freedom):
You can check your ANOVA table by running the following code:
## Df Sum Sq Mean Sq F value Pr(>F)
## Acid 2 6.852 3.426 4.513 0.0487 *
## Row 4 4.183 1.046 1.378 0.3235
## Residuals 8 6.072 0.759
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An experiment recorded the amount of force in Newtons (the response) that it takes to separate two pieces of plastic that have been glued together
The plastic had one of three different thicknesses (thin, moderate, thick) and was held together using one of two types of glue (wood vs. plastic)
There are two cases at each combination of factors.
GlueData <- read.file("http://colinreimerdawson.com/data/glue.csv") %>%
mutate(
Thickness = factor(Thickness, levels = c("Thin", "Moderate", "Thick")),
Glue = factor(Glue))
A two-predictor ANOVA model would look as follows:
\[ Force_i = \mu_{Overall} + \alpha_{Thickness_i} + \beta_{Glue_i} + \epsilon_i \]
Here is a plot of the Force values Thickness and Glue type, with the means at each combination shown with large squares.
GlueData %>%
gf_point(Force ~ Thickness, color = ~Glue, alpha = 0.3) +
stat_summary(fun = mean, geom = "point", shape = 15, size = 3)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
If we don’t necessarily expect the impact of one predictor to be the same across all levels of another, then we should allow for that in the model by introducing an interaction term. The model equation looks as follows:
\[ Force_i = \mu_{Overall} + \alpha_{Thickness_i} + \beta_{Glue_i} + \gamma_{{Thickness}_i,{Glue}_i} + \epsilon_i \]
where the \(\alpha\)s and \(\beta\)s have the same meaning as they do in the additive model, but the \(\gamma\)s allow for the adjustment due to Glue to itself be adjusted according to the Thickness (or vice versa).
The average adjustment due to Thin material should be \(\alpha_{Thin}\) and the average adjustment due to using Wood glue should be \(\beta_{Wood}\). Therefore the average “adjusted adjustment” should be zero for each Thickness and each glue type. That is, if we make a table:
Thickness/Glue | Wood | Plastic |
---|---|---|
Thin | ||
Moderate | ||
Thick |
filling in the cells with the gamma values, then each row and each column will sum to zero.
We’ll look closer at interaction models later.