STAT 213: One Way ANOVA Lab

Learning Goals

Gain familiarity with “effect-coding” and strengthen intuitions about the meaning of parameters in a one-predictor ANOVA model

The Data

We will continue looking at the data on the dietary behavior and metabolic functioning of lab mice under different environmental light cycle conditions.

The data is in LightatNight4Weeks in the Lock5Data package:

library(Lock5Data)
data(LightatNight4Weeks)

The Effect-Coded Model

Our first response variable is Consumption, which records the amount of food in grams consumed by each mouse per day. Our explanatory variable is the factor Light, which has three levels:

LD: the control condition, with light during the day and no light at night
DM: dim light at night (comparable to a TV set in the room), and normal light during the day
LL: normal daytime light during both day and night

A means model can be written as:

\[ Consumption_i = \mu_{Light_i} + \epsilon_i \]

Using effect coding, we can separate \(\mu_{Light_i}\) into \(\mu_{Overall}\), the overall mean food consumption averaged over all three groups, plus a “perturbation” term, \(\alpha_{Light_i}\) which represents the difference between the mean food consumption for the lighting conditions that mouse \(i\) is in.

This gives the model equation:

\[ Consumption_i = \mu_{Overall} + \alpha_{Light_i} + \epsilon_i \]

As in linear regression, our inference methods will be based on having residuals that are approximately Normally distributed, with zero mean, and the same variance for all groups.

Parameter Estimates

We can estimate \(\mu_{Overall}\) using the overall mean in the data, and we can estimate \(\mu_{Light_i}\)s using the mean for the groups.

mu_hat_overall <- mean(~Consumption, data = LightatNight4Weeks)
mu_hat_overall

## [1] 4.31463

mu_hat_groups <- mean(Consumption ~ Light, data = LightatNight4Weeks)
mu_hat_groups

##       DM       LD       LL 
## 4.124100 4.327500 4.514889

Using the fact that \(\mu_{Light_i} = \mu_{Overall} + \alpha_{Light_i}\), calculate \(\hat{\alpha}_{LD}\), \(\hat{\alpha}_{DM}\) and \(\hat{\alpha}_{LL}\).

SOLUTION

Partitioning Variability

In order to quantify how much of the variability in food consumption is accounted for by the lighting conditions, and to carry out a hypothesis test of the null hypothesis that lighting conditions have no impact on mean food consumption, we need to partition the variability in Consumption into two components: SS_{Model}, which is the component accounted for by the predictor, and SS_{Residuals}, which is the component not accounted for by the predictor.

Subtracting \(\mu_{Overall}\) from both sides of the model equation, we have:

\[ (Y_i - \mu_{Overall}) = \alpha_{X_i} + \epsilon_i \]

Squaring both sides gives:

\[ (Y_i - \mu_{Overall})^2 = (\alpha_{X_i} + \epsilon_i)^2 = \alpha_{X_i}^2 + 2 \alpha_{X_i} \epsilon_i + \epsilon_i^2\]

and summing over all \(N\) observations gives:

\[ \sum_{i=1}^N (Y_i - \mu_{Overall})^2 = \sum_{i=1}^N \alpha_{X_i}^2 + 2 \sum_{i=1}^N \alpha_{X_i} \epsilon_i + \sum_{i=1}^N \epsilon_i^2 \]

One way to sum terms over all \(N\) observations is to sum the terms separately for each group, and then add the component sums. If we define \(N_k\) as the number of mice in lighting group \(k\), and we denote the consumption score for the \(j\)th mouse in group \(k\) as \(Y_{k,j}\), then we can rewrite the term

\[ \sum_{i=1}^N \alpha_{X_i}^2 \] as a nested sum:

\[ \sum_{k=1}^K \sum_{j=1}^{N_k} \alpha_k^2 \] where \(K\) is the number of groups (\(K = 3\) for this data). The way to read this is: for each \(k\), find the value of the term inside for each \(j\) going from 1 to \(N_k\), and sum them. Then, sum these sums across all of the values of \(k\).

What is a simpler way to write \(\sum_{j=1}^{N_k} \alpha_k^2\) ? (Hint: \(\alpha_k\) does not depend on \(j\), so we are adding the same number over and over again)

SOLUTION

Using this simplification and the estimates you found above for the \(\alpha\)s, find \(\sum_{j=1}^{N_k} \hat{\alpha}_k^2\) for each lighting condition.

SOLUTION

We can do a similar manipulation of the term \(\sum_{i=1}^N \alpha_i \epsilon_i\), writing the residual for the \(j\)th mouse in group \(k\) as \(\epsilon_{k,j}\), and summing first within each group and then summing the partial sums.

\[ \sum_{i=1}^N \alpha_i \epsilon_i = \sum_{k=1}^K \sum_{j=1}^{N_k} \alpha_k \epsilon_{k,j} \]

Since \(\alpha_k\) isn’t changing with \(j\), we can factor it out of the inner sum, to get

\[ \sum_{k=1}^K \alpha_k \sum_{j=1}^{N_k} \epsilon_{k,j} \]

Since the residuals are the difference between the actual Consumption value and the mean for the group, what is the average value of the \(\epsilon\)s within each group? Therefore what is the sum of the \(\epsilon\)s within each group? What can we say about this entire term in the decomposition of variability?

SOLUTION

We are therefore able to simplify our decomposition of variability to:

\[ \sum_{i=1}^N (Y_i - \mu_{Overall})^2 = \sum_{k=1}^K \sum_{j=1}^{N_k} \alpha_{k}^2 + \sum_{i=1}^N \epsilon_i^2 \]

In other words:

\[ SS_{Total} = SS_{Model} + SS_{Residuals} \]

We found \(SS_{Model}\) for the mouse data in Exercise 3. We can find \(SS_{Total}\) by first finding \(Y_i - \mu_{Overall}\) for each mouse, and then squaring the results and summing them.

LightatNight4Weeks <- LightatNight4Weeks %>%
  mutate(
    DeviationScore = Consumption - mu_hat_overall,
    SquaredDeviationScore = DeviationScore^2
  )
LightatNight4Weeks %>% select(DeviationScore, SquaredDeviationScore)

##    DeviationScore SquaredDeviationScore
## 1     -0.92762963           0.860496730
## 2     -0.90062963           0.811133730
## 3     -0.52362963           0.274187989
## 4     -0.86362963           0.745856137
## 5     -1.09562963           1.200404285
## 6     -0.81362963           0.661993174
## 7     -0.70162963           0.492284137
## 8     -0.39162963           0.153373767
## 9     -0.07462963           0.005569582
## 10    -0.45762963           0.209424878
## 11     0.17437037           0.030405026
## 12    -0.72762963           0.529444878
## 13    -0.15362963           0.023602063
## 14    -0.83562963           0.698276878
## 15     0.19937037           0.039748545
## 16     0.10137037           0.010275952
## 17     1.62537037           2.641828841
## 18    -0.08362963           0.006993915
## 19     0.27137037           0.073641878
## 20     1.00937037           1.018828545
## 21     0.57537037           0.331051063
## 22     0.55837037           0.311777471
## 23     0.31837037           0.101359693
## 24     0.63137037           0.398628545
## 25     2.86237037           8.193164137
## 26     0.53437037           0.285551693
## 27    -0.31062963           0.096490767

SS_Total = sum(~SquaredDeviationScore, data = LightatNight4Weeks)
SS_Total

## [1] 20.20579

Using this result and the value of \(SS_{Model}\) you found in Exercise 3, calculate \(SS_{Residuals}\).

SOLUTION

Degrees of Freedom

The degrees of freedom associated with each component of variation quantifies the number of nonredundant pieces of information that went into the calculation (assuming we know the number of observations in each group in advance).

In the case of \(SS_{Residuals}\) recall that the sum of the residuals within each group has to be zero, because of the way the residuals are defined (in terms of the group mean).

There are 8 mice in the control (LD) condition. How many residuals are there in total for that group? How many of these do you actually need to know to be able to infer the rest?

SOLUTION

The other groups have 10 and 9 mice in them, respectively. How many residuals are there in total for this data? How many are nonredundant? This is our value for \(df_{Residuals}\).

SOLUTION

To understand how many nonredundant pieces of information go into \(SS_{Model}\), it is instructive to look at a weighted sum of the \(\alpha\)s, where the weights are the number of observations in each group.

\[ N_{LD} \alpha_{LD} + N_{DM} \alpha_{DM} + N_{LL} \alpha_{LL} \]

Replacing each \(\alpha\) with its definition in terms of group means and the overall mean, this becomes

\[ N_{LD} (\mu_{LD} - \mu_{Overall}) + N_{DM} (\mu_{DM} - \mu_Overall) + N_{LL} (\mu_{LL} - \mu_{Overall}) \]

Distributing and grouping gives us:

\[ (N_{LD} \mu_{LD} + N_{DM} \mu_{DM} + N_{LL} \mu_{LL}) - N \mu_{Overall} \]

What’s another name for the number of observations times the mean of those observations? (Remember how you calculate a mean)

SOLUTION

Therefore, what will the equation above be equal to?

SOLUTION

Since that equation is equivalent to \(N_{LD} \alpha_{LD} + N_{DM} \alpha_{DM} + N_{LL} \alpha_{LL}\), if we assume we know the \(N\)s in advance, how many of the \(\alpha\)s do we need to know before we can calculate the rest? This is our value for \(df_{Model}\).

SOLUTION

Check your answers for \(SS\) and \(df\) by running the following code (the column labeled Light is the variability component for the model, since that’s our only predictor):

aovModel <- aov(Consumption ~ Light, data = LightatNight4Weeks)
aovModel

Checking Conditions

Because the predicted value for each mouse is the mean of its group, it is a guarantee that the residuals have zero mean within each group. However, we should still check for approximate Normality and approximate constant variance.

To check Normality, we can use a side-by-side boxplot to look at distributions within each group, and a QQ plot to look at the overall distribution as usual:

gf_boxplot(~residuals(aovModel) | Light, data = LightatNight4Weeks)

gf_qq(~residuals(aovModel)) %>%
  gf_qqline()

To check for constant variance, we can simply calculate the standard deviations of the observations in each group. As a rough guideline, we’d like the ratio between the largest and smallest standard deviation not to exceed 2.

sd(Consumption ~ Light, data = LightatNight4Weeks)

##        DM        LD        LL 
## 0.6937946 0.4337033 1.3148988

How well are the conditions satisfied for this data and model?

SOLUTION

Carrying out the hypothesis test

The conditions are potentially violated here, due to the possible outlier in the LL group, which both distorts the Normality of the residuals in that group and inflates their standard deviation. We might consider fitting the model with and without that mouse to see how much the results change.

But for now, for the sake of practice with the mechanics, let’s carry on with our hypothesis test, as we now have all the ingredients we need to fill out the ANOVA table.

Collect your answers to the previous questions to fill in the table below. Recall that each \(MS\) is the ratio between the \(SS\) in that row and the \(df\) in that row, and that \(F\) is equal to the ratio between \(MS_{Model}\) and \(MS_{Residuals}\).

SOLUTION

Source	MS	F
Model
Residuals		N/A
Total	N/A	N/A

Having found the \(F\) statistic, we can obtain the \(P\)-value associated with the null hypothesis that all of the \(\alpha\)s in the “long run of infinite data” would be zero, by finding the chance of getting a value that large or larger in an \(F\) distribution with the corresponding degrees of freedom.

Replace Fvalue, dfModel and dfResiduals in the command below, to obtain the \(P\)-value for this test. What is your conclusion in context?

SOLUTION

xpf(q = Fvalue, df1 = dfModel, df2 = dfResiduals, lower.tail = FALSE)

Check your answers by running the following code:

summary(aovModel)

STAT 213: One Way ANOVA Lab

Colin Reimer Dawson

August 13, 2021

Learning Goals

The Data

The Effect-Coded Model

Parameter Estimates

SOLUTION

Partitioning Variability

SOLUTION

SOLUTION

SOLUTION

SOLUTION

Degrees of Freedom

SOLUTION

SOLUTION

SOLUTION

SOLUTION

SOLUTION

Checking Conditions

SOLUTION

Carrying out the hypothesis test

SOLUTION

SOLUTION