Review of bivariate regression

Recall the model for bivariate least-squares regression.

When we regress \(Y\) and \(X\) we’re looking for a linear function, \(f(X)\), for which the following sum-of-squared deviations is minimized:

\[ \sum_{i=1}^n (y_i - f(x_i))^2 \]

The general form a linear function of one variable is a line,

\[ f(x) = a + bX \]

where \(b\) is the slope of the line and \(a\) is the intercept.

Multiple regression is conceptually similar to bivariate regression

The idea behind multiple regression is almost exactly the same as bivariate regression, except now we try and fit a linear model for \(Y\) using multiple explanatory variables, \(X_1, X_2,\ldots, X_m\). That is we’re looking for a linear function, \(f(X_1, X_2,\ldots,X_m)\) that minimizes:

\[ \sum_{i=1}^n(y_i - f(x_1, x_2,\ldots, x_m))^2 \]

A linear function of more than one variable is written as:

\[ f(X_1, X_2,\ldots,X_m) = a + b_1X_1 + b_2X_2 + \cdots + b_mX_m \]

Where \(a\) is the intercept and \(b_1, b_2,\ldots,b_m\) are the regression coefficients. Geometrically the regression coefficients have the same interpretation as in the bivariate case – slopes with respect to the corresponding variable.

Solving the multiple regression problem

Mathematically, we simultaneously find the best fitting regression coefficients, \(b_1, b_2,\ldots,b_m\), using linear algebra. However, since we haven’t covered linear algebra in this course, I will omit the details.

Interpretting Multiple Regression

Here are some things to keep in mind when interpretting a multple regression:

Synthetic data

As we’ve done previously, let’s generate some synthetic data where we “know” (specify) the underlying relationships among the variables so that we can see how close to the “truth” we get when we fit the linear model. Here, Y is a function of two variables, W and X.

set.seed(20160927)  # seed the RNG

W <- rnorm(50, mean = 1, sd = 2)
X <- rnorm(50, mean = 3, sd = 1)
Y <- 2*W + 5*X + rnorm(50, sd = 2)

3D plots and RColorBrewer color schemes

ggplot has no built in facilities for 3D scatter plots so we’ll use a package called plot3D. Install plot3D via the command line or using the Packages tab in the RStudio GUI.

library(plot3D)
scatter3D(W, X, Y,
          xlab = "W", ylab = "X", zlab = "Y",
  # "pch" is short for "plot character", it sets the symbol type to plot, which 
  # in this case a solid circle. For a list of possible characters and their 
  # codes see the help (?pch)
          pch = 16)

The plot3D package uses the a rainbow color scheme by default. This color scheme should generally be avoided.

Install the RColorBrewer package for easy access to more appropriate color schemes (install.packages("RColorBrewer")). RColorBrewer defines a set of color palettes that have been optimized for color discrimination, to be color blind friendly, etc. Once you’ve installed the RColorBrewer package you can see the available color palettes as so:

library(RColorBrewer)
# show representations of the palettes
par(cex = 0.5) # reduce size of text in the follow plot
display.brewer.all()  

Having loaded the RColorBrewer package we can pass a color palette to the scatter3D function, using brewer.pal (see ?brewer.pal). Here we use a gradient of blues.

scatter3D(W, X, Y,
          xlab = "W", ylab = "X", zlab = "Y",
          pch = 16,
          col = brewer.pal(7, "Blues"),
          theta = 20, phi = 20 # theta and phi setting viewing angle
          )

Of course, if you just want a uniform color, that’s possible too. Simply specify a single color in the col argument. Here we also add vertical lines to the plot using the type argument.

scatter3D(W, X, Y,
          xlab = "W", ylab = "X", zlab = "Y",
          col = "black", alpha = 0.75, pch = 16,
          theta = 20, phi = 20,
          bty = "g",
          type = "h")

For more examples of how you can modify plots generated with the plot3D package see this web page.

Lines, planes, and hyperplanes

A linear function of one variable is a line, a linear function of two variables is a plane, and a linear function of more than two variables is generally called a hyperplane.

Fitting a multiple regression model in R

Using the lm function, fitting multiple regression models is a straightforward extension of fitting a bivariate regression model.

df.WXY <- data.frame(W = W, X = X, Y = Y)
fit.YonWX <- lm(Y ~ W + X, df.WXY)
summary(fit.YonWX)

Call:
lm(formula = Y ~ W + X, data = df.WXY)

Residuals:
   Min     1Q Median     3Q    Max 
-3.966 -1.105 -0.392  0.842  3.869 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.479      0.801     0.6     0.55    
W              2.111      0.129    16.4   <2e-16 ***
X              4.903      0.246    19.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.82 on 47 degrees of freedom
Multiple R-squared:  0.938, Adjusted R-squared:  0.935 
F-statistic:  355 on 2 and 47 DF,  p-value: <2e-16

Visualizing the regression plane

For multiple regression on two predictor variables we can visualize the plane of best fit but adding it as a surface to our 3D plot.

# Create a regular grid over the range of W and X values
grid.lines = 35
W.grid <- seq(min(W), max(W), length.out = grid.lines)
X.grid <- seq(min(X), max(X), length.out = grid.lines)
WX.grid <- expand.grid( x = W.grid, y = X.grid)

df.WX.grid <- data.frame(W = WX.grid[,1], X = WX.grid[,2])

# Predicted Y at each point in grid
Y.grid <- matrix(predict(fit.YonWX, newdata = df.WX.grid), 
                 nrow = grid.lines, ncol = grid.lines)

# Predicted Y at observed 
Y.predicted <- predict(fit.YonWX)

# scatter plot with regression plane
scatter3D(W, X, Y, 
    pch = 16, theta = 20, phi = 20, 
    col = "red", alpha=0.9,
    ticktype = "detailed",
    xlab = "W", ylab = "X", zlab = "Y",  
    surf = list(x = W.grid,
                y = X.grid, 
                z = Y.grid, 
                facets = NA, 
                fit = Y.predicted,
                col = "black", alpha = 0.5)
    )

Interactive 3D Visualizations Using OpenGL

The package rgl is another package that we can use for 3D visualization. rgl is powerful because it lets us create interactive plots we can rotate and zoom in/out on.

On OS X, rgl requires you to install a program called XQuartz. XQuartz can be downloaded from the XQuartz Home Page. If you’re on a Mac, install XQuartz before installing rgl.

Once you’ve installed rgl try the following code:

library(rgl)

# create 3D scatter, using spheres to draw points
plot3d(W, X, Y, 
       type = "s", 
       size = 1.5,
       col = "red")
rglwidget() # only need to include this line if using in a markdown document

We can add a 3d plane to our plot, representing the multiple regression model, with the rgl.planes function as show below.

coefs <- coef(fit.YonWX)
b1 <- coefs["W"]
b2 <- coefs["X"]
c <- -1
a <- coefs["(Intercept)"]
rgl.planes(b1, b2, c, a, alpha = 0.9, color = "gray")
rglwidget()