We will continue looking at the data on the dietary behavior and metabolic functioning of lab mice under different environmental light cycle conditions.
The data is in LightatNight4Weeks
in the Lock5Data
package:
Our first response variable is Consumption
, which records the amount of food in grams consumed by each mouse per day. Our explanatory variable is the factor Light
, which has three levels:
LD
: the control condition, with light during the day and no light at nightDM
: dim light at night (comparable to a TV set in the room), and normal light during the dayLL
: normal daytime light during both day and nightA means model can be written as:
\[ Consumption_i = \mu_{Light_i} + \epsilon_i \]
Using effect coding, we can separate \(\mu_{Light_i}\) into \(\mu_{Overall}\), the overall mean food consumption averaged over all three groups, plus a “perturbation” term, \(\alpha_{Light_i}\) which represents the difference between the mean food consumption for the lighting conditions that mouse \(i\) is in.
This gives the model equation:
\[ Consumption_i = \mu_{Overall} + \alpha_{Light_i} + \epsilon_i \]
As in linear regression, our inference methods will be based on having residuals that are approximately Normally distributed, with zero mean, and the same variance for all groups.
We can estimate \(\mu_{Overall}\) using the overall mean in the data, and we can estimate \(\mu_{Light_i}\)s using the mean for the groups.
## [1] 4.31463
## DM LD LL
## 4.124100 4.327500 4.514889
In order to quantify how much of the variability in food consumption is accounted for by the lighting conditions, and to carry out a hypothesis test of the null hypothesis that lighting conditions have no impact on mean food consumption, we need to partition the variability in Consumption
into two components: SS_{Model}
, which is the component accounted for by the predictor, and SS_{Residuals}
, which is the component not accounted for by the predictor.
Subtracting \(\mu_{Overall}\) from both sides of the model equation, we have:
\[ (Y_i - \mu_{Overall}) = \alpha_{X_i} + \epsilon_i \]
Squaring both sides gives:
\[ (Y_i - \mu_{Overall})^2 = (\alpha_{X_i} + \epsilon_i)^2 = \alpha_{X_i}^2 + 2 \alpha_{X_i} \epsilon_i + \epsilon_i^2\]
and summing over all \(N\) observations gives:
\[ \sum_{i=1}^N (Y_i - \mu_{Overall})^2 = \sum_{i=1}^N \alpha_{X_i}^2 + 2 \sum_{i=1}^N \alpha_{X_i} \epsilon_i + \sum_{i=1}^N \epsilon_i^2 \]
One way to sum terms over all \(N\) observations is to sum the terms separately for each group, and then add the component sums. If we define \(N_k\) as the number of mice in lighting group \(k\), and we denote the consumption score for the \(j\)th mouse in group \(k\) as \(Y_{k,j}\), then we can rewrite the term
\[ \sum_{i=1}^N \alpha_{X_i}^2 \] as a nested sum:
\[ \sum_{k=1}^K \sum_{j=1}^{N_k} \alpha_k^2 \] where \(K\) is the number of groups (\(K = 3\) for this data). The way to read this is: for each \(k\), find the value of the term inside for each \(j\) going from 1 to \(N_k\), and sum them. Then, sum these sums across all of the values of \(k\).
We can do a similar manipulation of the term \(\sum_{i=1}^N \alpha_i \epsilon_i\), writing the residual for the \(j\)th mouse in group \(k\) as \(\epsilon_{k,j}\), and summing first within each group and then summing the partial sums.
\[ \sum_{i=1}^N \alpha_i \epsilon_i = \sum_{k=1}^K \sum_{j=1}^{N_k} \alpha_k \epsilon_{k,j} \]
Since \(\alpha_k\) isn’t changing with \(j\), we can factor it out of the inner sum, to get
\[ \sum_{k=1}^K \alpha_k \sum_{j=1}^{N_k} \epsilon_{k,j} \]
Consumption
value and the mean for the group, what is the average value of the \(\epsilon\)s within each group? Therefore what is the sum of the \(\epsilon\)s within each group? What can we say about this entire term in the decomposition of variability?We are therefore able to simplify our decomposition of variability to:
\[ \sum_{i=1}^N (Y_i - \mu_{Overall})^2 = \sum_{k=1}^K \sum_{j=1}^{N_k} \alpha_{k}^2 + \sum_{i=1}^N \epsilon_i^2 \]
In other words:
\[ SS_{Total} = SS_{Model} + SS_{Residuals} \]
We found \(SS_{Model}\) for the mouse data in Exercise 3. We can find \(SS_{Total}\) by first finding \(Y_i - \mu_{Overall}\) for each mouse, and then squaring the results and summing them.
LightatNight4Weeks <- LightatNight4Weeks %>%
mutate(
DeviationScore = Consumption - mu_hat_overall,
SquaredDeviationScore = DeviationScore^2
)
LightatNight4Weeks %>% select(DeviationScore, SquaredDeviationScore)
## DeviationScore SquaredDeviationScore
## 1 -0.92762963 0.860496730
## 2 -0.90062963 0.811133730
## 3 -0.52362963 0.274187989
## 4 -0.86362963 0.745856137
## 5 -1.09562963 1.200404285
## 6 -0.81362963 0.661993174
## 7 -0.70162963 0.492284137
## 8 -0.39162963 0.153373767
## 9 -0.07462963 0.005569582
## 10 -0.45762963 0.209424878
## 11 0.17437037 0.030405026
## 12 -0.72762963 0.529444878
## 13 -0.15362963 0.023602063
## 14 -0.83562963 0.698276878
## 15 0.19937037 0.039748545
## 16 0.10137037 0.010275952
## 17 1.62537037 2.641828841
## 18 -0.08362963 0.006993915
## 19 0.27137037 0.073641878
## 20 1.00937037 1.018828545
## 21 0.57537037 0.331051063
## 22 0.55837037 0.311777471
## 23 0.31837037 0.101359693
## 24 0.63137037 0.398628545
## 25 2.86237037 8.193164137
## 26 0.53437037 0.285551693
## 27 -0.31062963 0.096490767
## [1] 20.20579
The degrees of freedom associated with each component of variation quantifies the number of nonredundant pieces of information that went into the calculation (assuming we know the number of observations in each group in advance).
In the case of \(SS_{Residuals}\) recall that the sum of the residuals within each group has to be zero, because of the way the residuals are defined (in terms of the group mean).
LD
) condition. How many residuals are there in total for that group? How many of these do you actually need to know to be able to infer the rest?To understand how many nonredundant pieces of information go into \(SS_{Model}\), it is instructive to look at a weighted sum of the \(\alpha\)s, where the weights are the number of observations in each group.
\[ N_{LD} \alpha_{LD} + N_{DM} \alpha_{DM} + N_{LL} \alpha_{LL} \]
Replacing each \(\alpha\) with its definition in terms of group means and the overall mean, this becomes
\[ N_{LD} (\mu_{LD} - \mu_{Overall}) + N_{DM} (\mu_{DM} - \mu_Overall) + N_{LL} (\mu_{LL} - \mu_{Overall}) \]
Distributing and grouping gives us:
\[ (N_{LD} \mu_{LD} + N_{DM} \mu_{DM} + N_{LL} \mu_{LL}) - N \mu_{Overall} \]
Check your answers for \(SS\) and \(df\) by running the following code (the column labeled Light
is the variability component for the model, since that’s our only predictor):
Because the predicted value for each mouse is the mean of its group, it is a guarantee that the residuals have zero mean within each group. However, we should still check for approximate Normality and approximate constant variance.
To check Normality, we can use a side-by-side boxplot to look at distributions within each group, and a QQ plot to look at the overall distribution as usual:
To check for constant variance, we can simply calculate the standard deviations of the observations in each group. As a rough guideline, we’d like the ratio between the largest and smallest standard deviation not to exceed 2.
## DM LD LL
## 0.6937946 0.4337033 1.3148988
The conditions are potentially violated here, due to the possible outlier in the LL
group, which both distorts the Normality of the residuals in that group and inflates their standard deviation. We might consider fitting the model with and without that mouse to see how much the results change.
But for now, for the sake of practice with the mechanics, let’s carry on with our hypothesis test, as we now have all the ingredients we need to fill out the ANOVA table.
Source | df | SS | MS | F |
---|---|---|---|---|
Model | ||||
Residuals | N/A | |||
Total | N/A | N/A |
Having found the \(F\) statistic, we can obtain the \(P\)-value associated with the null hypothesis that all of the \(\alpha\)s in the “long run of infinite data” would be zero, by finding the chance of getting a value that large or larger in an \(F\) distribution with the corresponding degrees of freedom.