Recall the model for bivariate least-squares regression.
When we regress \(Y\) and \(X\) we’re looking for a linear function, \(f(X)\), for which the following sum-of-squared deviations is minimized:
\[ \sum_{i=1}^n (y_i - f(x_i))^2 \]
The general form a linear function of one variable is a line,
\[ f(x) = a + bX \]
where \(b\) is the slope of the line and \(a\) is the intercept.
The idea behind multiple regression is almost exactly the same as bivariate regression, except now we try and fit a linear model for \(Y\) using multiple explanatory variables, \(X_1, X_2,\ldots, X_m\). That is we’re looking for a linear function, \(f(X_1, X_2,\ldots,X_m)\) that minimizes:
\[ \sum_{i=1}^n(y_i - f(x_1, x_2,\ldots, x_m))^2 \]
A linear function of more than one variable is written as:
\[ f(X_1, X_2,\ldots,X_m) = a + b_1X_1 + b_2X_2 + \cdots + b_mX_m \]
Where \(a\) is the intercept and \(b_1, b_2,\ldots,b_m\) are the regression coefficients. Geometrically the regression coefficients have the same interpretation as in the bivariate case – slopes with respect to the corresponding variable.
Mathematically, we simultaneously find the best fitting regression coefficients, \(b_1, b_2,\ldots,b_m\), using linear algebra. However, since we haven’t covered linear algebra in this course, I will omit the details.
Here are some things to keep in mind when interpretting a multple regression:
Comparing the size of regression coefficients only makes sense if all the predictor (explanatory) variables have the same scale
If the explanatory variables (\(X_1, X_2,\ldots,X_m\)) are highly correlated, then the regression solution is “unstable” – a small change in the data could lead to a very large change in the regression model.
As we’ve done previously, let’s generate some synthetic data where we “know” (specify) the underlying relationships among the variables so that we can see how close to the “truth” we get when we fit the linear model. Here, Y is a function of two variables, W and X.
set.seed(20160927) # seed the RNG
W <- rnorm(50, mean = 1, sd = 2)
X <- rnorm(50, mean = 3, sd = 1)
Y <- 2*W + 5*X + rnorm(50, sd = 2)
ggplot has no built in facilities for 3D scatter plots so we’ll use a package called plot3D
. Install plot3D
via the command line or using the Packages tab in the RStudio GUI.
library(plot3D)
scatter3D(W, X, Y,
xlab = "W", ylab = "X", zlab = "Y",
# "pch" is short for "plot character", it sets the symbol type to plot, which
# in this case a solid circle. For a list of possible characters and their
# codes see the help (?pch)
pch = 16)
The plot3D package uses the a rainbow color scheme by default. This color scheme should generally be avoided.
Install the RColorBrewer
package for easy access to more appropriate color schemes (install.packages("RColorBrewer")
). RColorBrewer defines a set of color palettes that have been optimized for color discrimination, to be color blind friendly, etc. Once you’ve installed the RColorBrewer package you can see the available color palettes as so:
library(RColorBrewer)
# show representations of the palettes
par(cex = 0.5) # reduce size of text in the follow plot
display.brewer.all()
Having loaded the RColorBrewer package we can pass a color palette to the scatter3D
function, using brewer.pal
(see ?brewer.pal
). Here we use a gradient of blues.
scatter3D(W, X, Y,
xlab = "W", ylab = "X", zlab = "Y",
pch = 16,
col = brewer.pal(7, "Blues"),
theta = 20, phi = 20 # theta and phi setting viewing angle
)
Of course, if you just want a uniform color, that’s possible too. Simply specify a single color in the col
argument. Here we also add vertical lines to the plot using the type
argument.
scatter3D(W, X, Y,
xlab = "W", ylab = "X", zlab = "Y",
col = "black", alpha = 0.75, pch = 16,
theta = 20, phi = 20,
bty = "g",
type = "h")
For more examples of how you can modify plots generated with the plot3D
package see this web page.
A linear function of one variable is a line, a linear function of two variables is a plane, and a linear function of more than two variables is generally called a hyperplane.
Using the lm
function, fitting multiple regression models is a straightforward extension of fitting a bivariate regression model.
df.WXY <- data.frame(W = W, X = X, Y = Y)
fit.YonWX <- lm(Y ~ W + X, df.WXY)
summary(fit.YonWX)
Call:
lm(formula = Y ~ W + X, data = df.WXY)
Residuals:
Min 1Q Median 3Q Max
-3.966 -1.105 -0.392 0.842 3.869
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.479 0.801 0.6 0.55
W 2.111 0.129 16.4 <2e-16 ***
X 4.903 0.246 19.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.82 on 47 degrees of freedom
Multiple R-squared: 0.938, Adjusted R-squared: 0.935
F-statistic: 355 on 2 and 47 DF, p-value: <2e-16
For multiple regression on two predictor variables we can visualize the plane of best fit but adding it as a surface to our 3D plot.
# Create a regular grid over the range of W and X values
grid.lines = 35
W.grid <- seq(min(W), max(W), length.out = grid.lines)
X.grid <- seq(min(X), max(X), length.out = grid.lines)
WX.grid <- expand.grid( x = W.grid, y = X.grid)
df.WX.grid <- data.frame(W = WX.grid[,1], X = WX.grid[,2])
# Predicted Y at each point in grid
Y.grid <- matrix(predict(fit.YonWX, newdata = df.WX.grid),
nrow = grid.lines, ncol = grid.lines)
# Predicted Y at observed
Y.predicted <- predict(fit.YonWX)
# scatter plot with regression plane
scatter3D(W, X, Y,
pch = 16, theta = 20, phi = 20,
col = "red", alpha=0.9,
ticktype = "detailed",
xlab = "W", ylab = "X", zlab = "Y",
surf = list(x = W.grid,
y = X.grid,
z = Y.grid,
facets = NA,
fit = Y.predicted,
col = "black", alpha = 0.5)
)
The package rgl
is another package that we can use for 3D visualization. rgl
is powerful because it lets us create interactive plots we can rotate and zoom in/out on.
On OS X, rgl requires you to install a program called XQuartz. XQuartz can be downloaded from the XQuartz Home Page. If you’re on a Mac, install XQuartz before installing rgl
.
Once you’ve installed rgl
try the following code:
library(rgl)
# create 3D scatter, using spheres to draw points
plot3d(W, X, Y,
type = "s",
size = 1.5,
col = "red")
rglwidget() # only need to include this line if using in a markdown document
We can add a 3d plane to our plot, representing the multiple regression model, with the rgl.planes
function as show below.
coefs <- coef(fit.YonWX)
b1 <- coefs["W"]
b2 <- coefs["X"]
c <- -1
a <- coefs["(Intercept)"]
rgl.planes(b1, b2, c, a, alpha = 0.9, color = "gray")
rglwidget()