The Data

One of the first models we looked at tried to capture the relationship between the size of a house and its sale price. The HomesForSale dataset in the Lock5Data package includes list prices (not necessarily the price the house actually sells for, but the price asked in the initial listing) along with several other characteristics from 120 houses sold in four different U.S. states (California, New Jersey, New York, and Pennsylvania)

Let’s load the data.

A Single Predictor Model

One of the most obvious predictors of home price is the size of the house.

Let’s first plot Price (in 1000s of US dollars) against Size (in 1000s of square feet), and fit a simple linear regression model using Size as a predictor of Price.

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

  1. Right off the bat, do you have any concerns about this model? What are they?

RESPONSE

We can tell from the plot and the line that this data does not seem to follow a linear relationship very well. Also, for larger houses, the data seems to be much farther from the line than it is for smaller houses.


Let’s look at our usual residual plots to see how well the regression conditions are satisfied.

  1. How would you characterize the violations of the regression conditions here. Cite specific information from the residual plots.

SOLUTION

Essentially all of the conditions are violated pretty badly. The residuals don’t have zero mean for all fitted values: the model underestimates the prices of the smallest houses (those with the lowest fitted values) and overestimates moderately sized houses. In addition the variability of the residuals is larger for the houses with the highest predicted prices (i.e., the largest houses). Finally, the distribution of the residuals does not look particularly close to Normal, with some notable right skew.


Transforming the Response

In many cases people compare prices in terms of their proportional difference, rather than the actual subtractive difference. When people get raises, when prices are marked up, when investments earn returns, etc., these are usually measured in percentage change terms rather than in raw difference terms. Between this, the right skew of the residual prices, and the higher variability of residuals at higher expected prices, it might make sense to model proportional differences by log transforming the price variable.

To facilitate the use of some convenient plotting functions, let’s store the log prices as an actual column in the data.

Here’s the data with Price having been transformed via a base 10 log.

Let’s fit a model to this data and examine the residual plots.

  1. How well are the regression conditions satisfied now? Are there any remaining concerns?

SOLUTION

The zero mean and constant variance conditions are much closer to being satisfied now. The residuals still look somewhat right-skewed, though the largest residual is not nearly as much of an outlier as it was before.


The residual distribution isn’t perfect, but let’s go with it for now.

Here are the coefficients for this model, along with confidence intervals for these coefficients in the population.

## (Intercept)        Size 
##       2.027       0.232
##                 2.5 %    97.5 %
## (Intercept) 1.9345517 2.1187639
## Size        0.1920773 0.2715171
  1. Interpret the slope, as well as the confidence interval for the slope, in context. Remember that a one unit difference on the log scale corresponds to multiplying the original value by the base of the log. How many additional square feet are needed to add a zero to the end of the predicted list price?

SOLUTION

For a house which is 1000 sq ft. bigger, we expect its log base 10 price to be between 0.192 and 0.271 units higher (with 95% confidence), with our best guess being 0.232. We are 95% confident that this margin is between 0.192 and 0.27 That is, we expect the actual price to be between 10^0.192 or 1.56 times higher and 10^0.272 = 1.87 times higher, with a best guess of 10^0.232 = 1.71 times higher.


Modeling Differences Between States

Is the relationship between asking price and home size different across states? It seems likely that it would be. The variable State is a categorical variable with four levels: CA, NJ, NY and PA, corresponding to the four US states the houses in this dataset are in.

Here’s a scatterplot of the data with the houses color-coded by state.

  1. Fit two more models to this data: one where prices can generally differ by state, but where the relationship between Size and log10(Price) is not different, and one where both the general price level and the relationship between size and price can differ. Use plotModel() with each one to verify that the models look like you expect.

SOLUTION

If only the level can differ across states, then we just apply the same (proportional) price adjustment based on the state regardless of the size of the house. That model looks like this:

## (Intercept)        Size     StateNJ     StateNY     StatePA 
##       2.103       0.228      -0.060      -0.006      -0.207

If the relationship can differ as well, then the slopes of the lines relating size to (log) price may differ as well between states, which we can account for by including interaction terms.

##  (Intercept)         Size      StateNJ      StateNY      StatePA 
##        2.092        0.233        0.105       -0.054       -0.287 
## Size:StateNJ Size:StateNY Size:StatePA 
##       -0.077        0.024        0.043

  1. The first model should have five coefficient all together: an intercept, a coefficient for Size, and coefficients for three indicator variables (called StateNJ, StateNY and StatePA in the output). What do the values of these coefficients tell us in context?

SOLUTION

The intercept and the Size coefficient tell us the intercept and slope for the reference line, which is for California.

Since there are no interaction terms in this model, the slopes of the lines are the same: that is, for each additional 1000 sq. ft. the predicted price goes up by a factor of 10^(0.233) = 1.710 — in other words, the predicted price increases by about 71%, regardless of the state.

The coefficient for StateNJ tells us how the intercept for NJ differs from the intercept for CA, but since the lines are parallel, this same gap applies at every square footage. Similarly the other State coefficicents tell us how the predicted price in state differs from the predicted price in CA for a house of the same size. Because the prices are on a log10 scale, the coefficient of -0.060 for New Jersey tells us that, controlling for size of the house, a house in New Jersey will have a predicted price which is 10^(-0.060) = 0.871 times (87.1% of) that of a similarly sized house in California. Similarly, a house in New York will have a predicted price which is 10^(-0.006) = 0.986 or 98.6% of that of a comparably sized house in California. Finally, a house in PA will have a predicted price which is 10^(-0.207) = 0.621, or 62.1% of that of a similarly sized house in CA.


  1. The second model should have eight coefficients all together: an intercept, a coefficient for Size, coefficients for three indicator variables (called StateNJ, StateNY and StatePA in the output), and coefficients for terms involving the product of each of these indicators with Size. What do the values of these coefficients tell us in context?

SOLUTION

The intercept and the Size coefficient tell us the intercept and slope for the reference line, which is for California. The coefficient for StateNJ tells us how the intercept for NJ differs from the intercept for CA. Similarly the other State coefficients tell us how the intercept for that state differs from the intercept for CA. However, due to the lines no longer being parallel, these values only correspond to price differences at Size = 0 which is not a real value, so it’s not too useful to interpret their numerical values on their own. (We could center the sizes if we wanted to, so that these coefficients represented price discrepancies for some realistic reference size instead of for Size = 0)

The interaction terms tell us how the slope in each state differs from the slope in California. For example, the coefficient of -0.077 for Size:StateNJ tells us that the increase in predicted price per additional 1000 sq ft is 0.077 units lower in New Jersey than it is in California. Whereas in California, the predicted price increases by a factor of 10^(0.233) = 1.71, in New Jersey, the predicted price only increases by a factor of 10^(0.233 - 0.077) = 1.43 per additional 1000 square feet. That is, each 1000 sq ft is associated with a 43% increase in predicted price in New Jersey.

(The calculations are similar for New York and PA)


  1. If we want to know whether the relationship between Size and Log10Price differs across these four states in the wider population of houses, we can use a hypothesis test comparing two nested models. What models would we compare?

SOLUTION

We would want to compare the model with the interaction terms to the model that includes Size and State as predictors but does not include the interaction. That is, we want to compare the model that fits separate lines for each state to the model that forces the lines to be parallel, capturing the notion that even though price levels differ across states, the relationship between Size and Log10Price doesn’t.


  1. Carry out the nested test for the two models in question (using anova(model1, model2) where you replace model1 and model2 with the names you gave the models being compared), and interpret the results in context.

SOLUTION
## Analysis of Variance Table
## 
## Model 1: Log10Price ~ Size + State
## Model 2: Log10Price ~ Size + State + Size:State
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    115 6.2259                           
## 2    112 5.9895  3   0.23644 1.4738 0.2255

We do not appear to have significant evidence from this data that the relationship between Size and Log10Price is different across these four states.


  1. If we want to know whether we need to account for state at all, we can also use a hypothesis test comparing two nested models. What models would we compare to answer this question?

SOLUTION

To answer this question, we’d want to compare the separate lines model (with the interaction terms) to the first model we fit, which does not include State at all.


  1. Carry out this second nested test and interpret the results in context.

SOLUTION
## Analysis of Variance Table
## 
## Model 1: Log10Price ~ Size
## Model 2: Log10Price ~ Size + State + Size:State
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    118 7.0607                                
## 2    112 5.9895  6    1.0712 3.3385 0.004606 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We do have significant evidence that the full interaction model better captures the relationship than a model that doesn’t account for state.


  1. Which of the three models of Log10Price would you opt for in practice based on this investigation? Why?

SOLUTION

Since we don’t have significant evidence that the relationship between Size and Log10Price differs across these four states, but we do have significant evidence that the actual prices differ after controlling for size, we might want to opt for the model that has parallel lines. We may want to check that this model is a significant improvement over the Size only model:

## Analysis of Variance Table
## 
## Model 1: Log10Price ~ Size
## Model 2: Log10Price ~ Size + State
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    118 7.0607                                
## 2    115 6.2259  3   0.83476 5.1396 0.002264 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

which it is. So the added complexity associated with adding the State variable is justified, but the additional complexity from the interaction terms may not be needed.


  1. The following command produces a 95% confidence interval based on houses in Pennsylvania with a size of 1500 square feet, using the “parallel lines model”. Interpret it in context using actual dollar amounts.
##        fit      lwr      upr
## 1 2.237703 2.152177 2.323229

SOLUTION

For houses in PA which are around 1500 sq. ft., we are 95% confident that the mean log price is between 2.237 and 2.323 — that is, we are 95% confident that the geometric mean of the prices in this population of houses is between 10^2.237 * 1000 = 172,584 and 10^2.323 * 1000 = 210,378 dollars.


  1. The following command produces a 95% prediction interval based on a house in Pennsylvania with a size of 1500 square feet, using the “parallel lines model”. Interpret it in context using actual dollar amounts, making sure to contrast the interpretation with the confidence interval.
##        fit      lwr      upr
## 1 2.237703 1.768946 2.706459

SOLUTION

If we have a given house in PA which is around 1500 sq. ft., we can be 95% confident that its list price is between 1.769 and 2.706; that is that its actual price is between 10^1.769 * 1000 = 58,749 and 10^2.706 * 1000 = 508,159 dollars.