Logistic regression

Logistic regression is used when the dependent variable is discrete (often binary). The explanatory variables may be either continuous or discrete.

Examples:

Whether a gene is turned off (=0) or on (=1) as a function of levels of various proteins
Whether an individual is healthy (=0) or diseased (=1) as a function of various risk factors.
Whether an individual died (=0) or survived (=1) some selective event as a function of behavior, morphology, etc.

We model the binary response variable, \(Y\), as a function of the predictor variables, \(X_1\), \(X_2\), etc as :

\[ P(Y = 1|X_1,\ldots,X_p) = f(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p) \]

So we’re modeling the probability of the states as a function of a linear combination of the predictor variables.

For logistic regression, \(f\) is thus the logistic function: \[ f(z) = \frac{e^z}{1+e^z} = \frac{1}{1 + e^{-z}} \]

Therefore, the bivariate logistic regression is given by: \[ P(Y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

Titanic data set

titanic.csv contains information about passengers on the Titanic. Variables in this data set include information such as sex, age, passenger class (1st, 2nd, 3rd), and whether or not they survived the sinking of the ship (0 = died, 1 = survived).

library(ggplot2)
library(dplyr)
library(cowplot)  # for plot_grid fxn

titanic <- read.csv("https://github.com/pmagwene/Bio723/raw/master/datasets/titanic.csv")
names(titanic)

 [1] "row.names" "pclass"    "survived"  "name"      "age"      
 [6] "embarked"  "home.dest" "room"      "ticket"    "boat"     
[11] "sex"

Subsetting the data

We’ve all heard the phrase, “Women and children first”, so we might expect that the probability that a passenger survived the sinking of the Titanic related to their sex and/or age. Let’s create separate data subsets for male and female passengers.

male <- filter(titanic, sex == "male")
dim(male)

[1] 850  11

female <- filter(titanic, sex == "female")
dim(female)

[1] 463  11

Visualizing survival as a function of age

Let’s create visualizations of survival as a function of age for the male and female passengers.

fcolor = "lightcoral"
mcolor = "lightsteelblue"

female.plot <- ggplot(female, aes(x = age, y = survived)) + 
  geom_jitter(width = 0, height = 0.1, color = fcolor) +
  labs(title = "Female Passengers")

male.plot <- ggplot(male, aes(x = age, y = survived)) + 
  geom_jitter(width = 0, height = 0.1, color = mcolor) + 
  labs(title = "Male Passengers")

plot_grid(female.plot, male.plot)

Fitting the logistic regression model

The function glm (generalized linear model) can be used to fit the logistic regression model (as well as other models). Setting the argument family = binomial gives us logistic regression.

fit.female <- glm(survived ~ age, family = binomial, female)
summary(fit.female)


Call:
glm(formula = survived ~ age, family = binomial, data = female)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.192   0.484   0.593   0.680   0.849  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.8309     0.3663    2.27    0.023 *
age           0.0234     0.0119    1.97    0.049 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 229.88  on 242  degrees of freedom
Residual deviance: 225.81  on 241  degrees of freedom
  (220 observations deleted due to missingness)
AIC: 229.8

Number of Fisher Scoring iterations: 4

fit.male <- glm(survived ~ age, family = binomial, male)
summary(fit.male)


Call:
glm(formula = survived ~ age, family = binomial, data = male)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.062  -0.737  -0.628  -0.407   2.239  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.26565    0.28982   -0.92  0.35936    
age         -0.03593    0.00949   -3.79  0.00015 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 401.15  on 389  degrees of freedom
Residual deviance: 385.37  on 388  degrees of freedom
  (460 observations deleted due to missingness)
AIC: 389.4

Number of Fisher Scoring iterations: 4

Visualizing the logistic regression

To visualize the logistic regression fit, we first use the predict function to generate the model predictions about probability of survival as a function of age.

ages <- seq(0, 75, 1) # predict survival for ages 0 to 75

predicted.female <- predict(fit.female, 
                            newdata = data.frame(age = ages),
                            type = "response")

predicted.male <- predict(fit.male,
                          newdata = data.frame(age = ages),
                          type = "response")

Having generated the predicted probabilities of survival we can then draw those predictions using geom_line.

female.logistic.plot <- female.plot + 
  geom_line(data = data.frame(age = ages, survived = predicted.female),
            color = fcolor, size = 1)

male.logistic.plot <- male.plot + 
  geom_line(data = data.frame(age = ages, survived = predicted.male),
            color = mcolor, size = 1)

plot_grid(female.logistic.plot, male.logistic.plot)

Quick and dirty visualization

Here’s an alternative “quick and dirty” way to generate the plot above using the awesome power of ggplot. The downside of this approach is we don’t generate the detailed information on the model, which is something you’d certainly want to have in any real analysis.

ggplot(titanic, aes(x=age, y=survived, color=sex)) + 
  geom_jitter(width = 0, height = 0.1) +
  geom_smooth(method="glm", method.args = list(family="binomial"))  + 
  facet_grid(. ~ sex) +
  scale_color_manual(values = c(fcolor, mcolor))

Introduction to Logistic Regression

Paul M. Magwene

September 30, 2016