Send to Kindle

## A multi-regression exercise in R

This is a post to show a multi-regression example in R that analyzes the mile-per-gallon (MPG) of 32 automobiles to quantitatively decide whether an automatic or manual transmission better for MPG. Among 10 automobile design aspects, the final statistical model has been created by taking the design aspects of weight, V/S, transmission type, and number of carburetors. The model has $$R^{2}$$ of 0.81 and p-value < 0.01. In this model, transmission type does not appear to have a strong relationship with MPG, probably due to its strong correlation with car weight that has much higher significance effecting MPG.

(R source code is available 1 )

### Exploratory data analysis

The data set this exercise adopted is so-called “Motor Trend Car Road Tests”. It is based on 1974 Motor Trend US magazine, and is comprised of fuel consumption expressed as miles per gallon (MPG) of 32 automobiles of the models released between 1973 and 1974. The data also includes 10 aspects of the automobile designs: Number of cylinders, displacement (cu.in.), gross horsepower, rear axle ratio, weight, ¼ mile time, V/S, transmission, number of forward gears, and number of carburetors.

The MPGs of 32 range from 10.4 to 33.9. The mean is 20.09, and the standard deviation is 6.03. The distribution of the MPGs is positively skewed by 0.61.

The goal of this exercise is to quantitatively decide whether an automatic transmission (AT) or manual transmission (MT) better for MPG. The first step of the analysis is to compare MPG of the groups AT and MT. Unless a statistically significant difference is found here, the analysis using this data would be inconclusive. The p-value of t-test is 0, and 95% confidence interval is -11.28 and -3.21. Figure 1 shows the boxplot of the two groups. The t-test shows the MT cars are significantly higher in MPG than those with AT. So, further analysis is conducted in the rest of the report.

Figure 1. MPG difference between automatic and manual

### Models and strategy for the selection

In the scattered plot matrix of MPG and all the design aspects (Fig. 2 ), higher correlations to MPG are observed from some of the design aspects than others. It is also observable that there are correlations among some of the design aspects. The model must be selected so that it reflects the variance in coefficients as well as correctly controlling the effects by co-linearity. In this section, 3 preliminary models are created. Model 1 is a reference model with only transmission type is included as the independent variable. Model 2 is a model with all the design aspects are included. Model 3 is a model with reduced regressors via step-wise analysis of variance inflation factors (VIF). The final model is created including the rationale of automobile design instead of blindly adopting a statistical algorithm.

Figure 2. Scattered plot matrix

#### Model 1

As the first preliminary model, I included only transmission type as the independent variable. Running linear regression model yields $$R^{2}$$ of 0.36 and p-value of 0:

sum.lm.unadjusted$coeff Estimate Std. Error t value Pr(>|t|) (Intercept) 17.1 1.1 15.2 0.0000000000000011 am 7.2 1.8 4.1 0.0002850207439351 So, while p-value is small, we could construct a model with higher $$R^{2}$$. #### Model 2 Another preliminary model can be constructed by including all 10 independent variables. In this model, Adjusted $$R^{2}$$ is 0.81, but this is probably the result of having too many independent variable. None of the p-values is less than 0.05: sum.lm.all.in$coeff
Estimate Std. Error t value Pr(>|t|) (Intercept) 12.303 18.718 0.66 0.518 cyl -0.111 1.045 -0.11 0.916 disp 0.013 0.018 0.75 0.463 hp -0.021 0.022 -0.99 0.335 drat 0.787 1.635 0.48 0.635 wt -3.715 1.894 -1.96 0.063 qsec 0.821 0.731 1.12 0.274 vs 0.318 2.105 0.15 0.881 am 2.520 2.057 1.23 0.234 gear 0.655 1.493 0.44 0.665 carb -0.199 0.829 -0.24 0.812

#### Model 3

I first focus on finding the co-linearity among the independent variables that has the effect of inflating variance. One measurement we can use is variable inflation factor (VIF 2 ). The algorithm is:

1. Start with all the variables. Set the current VIF threshold to 5.
2. Build the regression model with the current set of independent variables.
3. If p-values from all the independent variables are smaller than 0.05, finish selecting variable.
4. Calculate VIF.
5. Remove all the independent variables that has VIF higher than the current threshold.
6. Lower the VIF threshold and go to Step 2.

With this algorithm, the design aspects of rear axle ratio (drat), V/S (vs), transmission type (am), and number of carburetors (carb) remained as regressors. Three of the 4 regressors have p-value smaller than 0.05:

sum.lm.vif$coeff Estimate Std. Error t value Pr(>|t|) (Intercept) 11.3 5.0 2.3 0.03199 drat 2.7 1.6 1.7 0.09635 vs 3.1 1.4 2.2 0.03884 am 4.9 1.5 3.3 0.00291 carb -1.5 0.4 -3.8 0.00074 However, the model has adjusted $$R^{2}$$ of less than 0.80 (0.77). This could be due to the limitation of only using one statistical algorithm in creating the model. #### Final model In stead of blindly adopting a statistical algorithm, one should also consult the domain knowledge in deciding the removal and addition of the regressors. Physically speaking, it is not a plausible idea to remove weight from the set of regressors in consideration of MPG. In Model 3, rear axle ratio doesn't have high p-value. The final model is constructed replacing rear axle ratio with weight. This resulted in a high adjusted $$R^{2}$$ (0.81). All the p-values are less than 0.05 except for transmission type: sum.lm.final$coeff
Estimate Std. Error t value Pr(>|t|) (Intercept) 29.31 3.53 8.3 0.0000000066 wt -2.85 0.94 -3.0 0.0054056214 vs 2.74 1.26 2.2 0.0385617194 am 3.07 1.57 2.0 0.0606601842 carb -0.88 0.40 -2.2 0.0366771003

### Diagnostics of residuals

Residuals vs. Fitted, Normal Q-Q, Scale-Location, and Residual vs. Leverage plots are generated for the diagnostics of residuals of the final model (Fig. 3). Residuals vs. Fitted and Normal Q-Q plots shows a good normality of the residual distributions. Residuals vs. Leverage with Cook's distance shows at least one point that has high leverage and low residual. Calculation of hat-values shows Maserati Bora has the highest value (0.43).

Figure 3. Diagnostics of residuals

### Analysis and conclusions

#### Q1. Is an automatic or manual transmission better for MPG

As described in the preliminary analysis, simple t-test of MPG by transmission type showed MT is better than AT; however, the final model, when weight is included, transmission type did not appear not to be strongly related to MPG anymore. To understand this deeply, I did a t-test of weights between automatic and manual transmissions. The difference was statistically significant (p-value < 0.01, confidence interval=0.85, 1.86). Assuming the automobile models in the data represent the design aspects of the cars in the market of the period, automatic transmission cars were deemed significantly heavier than manual transmissions in 1973, and the heavier weights of automatic cars have a negative effect to MPG.

#### Q2. Quantify the MPG difference between automatic and manual transmissions

In the model created by simple step-wise VIF analysis, p-value for transmission type is less than 0.05, and the coefficient ($$\beta_{am}$$) is 4.95, meaning that manual transmission has higher MPG than automatic transmission by 4.95 MPG. However, this model has relatively poor fit compared to the final model, and it was rejected. In the final model, the transmission type was not strongly related to MPG after adjustment if we take p-value of 0.05 as the threshold. If we relax the p-value to 0.1, MT cars may bring about 4.95 MPG improvement compared to AT cars.

### References and notes

1. The R markdown source code generated this document is availalbe on github repository.
2. Variable Inflation Factor.

Original post: April 20, 2015 | Last updated: May 26, 2017