Logistic regression is used when the dependent variable is discrete (often binary). The explanatory variables may be either continuous or discrete.
Examples:
We model the binary response variable, \(Y\), as a function of the predictor variables, \(X_1\), \(X_2\), etc as :
\[ P(Y = 1|X_1,\ldots,X_p) = f(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p) \]
So we’re modeling the probability of the states as a function of a linear combination of the predictor variables.
For logistic regression, \(f\) is thus the logistic function: \[ f(z) = \frac{e^z}{1+e^z} = \frac{1}{1 + e^{-z}} \]
Therefore, the bivariate logistic regression is given by: \[ P(Y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]
titanic.csv contains information about passengers on the Titanic. Variables in this data set include information such as sex, age, passenger class (1st, 2nd, 3rd), and whether or not they survived the sinking of the ship (0 = died, 1 = survived).
library(ggplot2)
library(dplyr)
library(cowplot) # for plot_grid fxn
titanic <- read.csv("https://github.com/pmagwene/Bio723/raw/master/datasets/titanic.csv")
names(titanic)
[1] "row.names" "pclass" "survived" "name" "age"
[6] "embarked" "home.dest" "room" "ticket" "boat"
[11] "sex"
We’ve all heard the phrase, “Women and children first”, so we might expect that the probability that a passenger survived the sinking of the Titanic related to their sex and/or age. Let’s create separate data subsets for male and female passengers.
male <- filter(titanic, sex == "male")
dim(male)
[1] 850 11
female <- filter(titanic, sex == "female")
dim(female)
[1] 463 11
Let’s create visualizations of survival as a function of age for the male and female passengers.
fcolor = "lightcoral"
mcolor = "lightsteelblue"
female.plot <- ggplot(female, aes(x = age, y = survived)) +
geom_jitter(width = 0, height = 0.1, color = fcolor) +
labs(title = "Female Passengers")
male.plot <- ggplot(male, aes(x = age, y = survived)) +
geom_jitter(width = 0, height = 0.1, color = mcolor) +
labs(title = "Male Passengers")
plot_grid(female.plot, male.plot)
The function glm (generalized linear model) can be used to fit the logistic regression model (as well as other models). Setting the argument family = binomial gives us logistic regression.
fit.female <- glm(survived ~ age, family = binomial, female)
summary(fit.female)
Call:
glm(formula = survived ~ age, family = binomial, data = female)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.192 0.484 0.593 0.680 0.849
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8309 0.3663 2.27 0.023 *
age 0.0234 0.0119 1.97 0.049 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 229.88 on 242 degrees of freedom
Residual deviance: 225.81 on 241 degrees of freedom
(220 observations deleted due to missingness)
AIC: 229.8
Number of Fisher Scoring iterations: 4
fit.male <- glm(survived ~ age, family = binomial, male)
summary(fit.male)
Call:
glm(formula = survived ~ age, family = binomial, data = male)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.062 -0.737 -0.628 -0.407 2.239
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.26565 0.28982 -0.92 0.35936
age -0.03593 0.00949 -3.79 0.00015 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 401.15 on 389 degrees of freedom
Residual deviance: 385.37 on 388 degrees of freedom
(460 observations deleted due to missingness)
AIC: 389.4
Number of Fisher Scoring iterations: 4
To visualize the logistic regression fit, we first use the predict function to generate the model predictions about probability of survival as a function of age.
ages <- seq(0, 75, 1) # predict survival for ages 0 to 75
predicted.female <- predict(fit.female,
newdata = data.frame(age = ages),
type = "response")
predicted.male <- predict(fit.male,
newdata = data.frame(age = ages),
type = "response")
Having generated the predicted probabilities of survival we can then draw those predictions using geom_line.
female.logistic.plot <- female.plot +
geom_line(data = data.frame(age = ages, survived = predicted.female),
color = fcolor, size = 1)
male.logistic.plot <- male.plot +
geom_line(data = data.frame(age = ages, survived = predicted.male),
color = mcolor, size = 1)
plot_grid(female.logistic.plot, male.logistic.plot)
Here’s an alternative “quick and dirty” way to generate the plot above using the awesome power of ggplot. The downside of this approach is we don’t generate the detailed information on the model, which is something you’d certainly want to have in any real analysis.
ggplot(titanic, aes(x=age, y=survived, color=sex)) +
geom_jitter(width = 0, height = 0.1) +
geom_smooth(method="glm", method.args = list(family="binomial")) +
facet_grid(. ~ sex) +
scale_color_manual(values = c(fcolor, mcolor))