Association Between Categorical Variables

A standard approach for testing the independence/dependence of a pair of categorical variables is to use a \(\chi^2\) (Chi-squared) test of independence.

The null and alternative hypotheses for the \(\chi^2\) test are as follows:

Contingency tables of observed counts

We typically depict the relationship between categorical variables using a “contingency table”.

B1 B2 Total
A1 \(O_{11}\) \(O_{12}\) \(O_{11}\)+\(O_{12}\)
A2 \(O_{21}\) \(O_{22}\) \(O_{12}\)+\(O_{22}\)
Total \(O_{11}\)+\(O_{21}\) \(O_{12}\)+\(O_{22}\) \(O_{11}\)+\(O_{12}\)+\(O_{12}\)+\(O_{22}\)

The rows and columns indicate the different categories for variables A and B respectively, and the cells, \(O_{ij}\), give the counts of the number of observations for the corresponding combination of A and B. For example, the cell \(O_{11}\) gives the number of observations that that belong to both the category A1 and B1, while \(O_{12}\) gives the number that are both A1 and B2, etc.

Example data set: Titanic passengers

Let’s revisit a data set we used when we discussed logistic regression that contains information about passengers on the Titanic.

titanic.csv includes information such as sex, age, passenger class (1st, 2nd, 3rd), and whether or not a passenger survived the sinking of the ship (0 = died, 1 = survived). We’ll use the this data set to illustrate the key steps of conducting a \(\chi^2\) test.

Constructing a contingency table in R

The R function table provides a simple way to construct contingency tables. Here’s how we can use it:

titanic <- read.csv("https://github.com/pmagwene/Bio723/raw/master/datasets/titanic.csv")

table(titanic$sex, titanic$survived)
        
           0   1
  female 156 307
  male   708 142

table doesn’t produce the marginal counts. However we can easily add those by including a call to addmargins:

addmargins(table(titanic$sex, titanic$survived))
        
            0    1  Sum
  female  156  307  463
  male    708  142  850
  Sum     864  449 1313

xtabs provides an alternate way to create a contingency table using a formula based interface.

sex.by.survived <- xtabs(~ sex + survived, titanic)
sex.by.survived
        survived
sex        0   1
  female 156 307
  male   708 142

Mosaic plots

A mosic plot is a useful way to visualize a contigency table.

mosaicplot(sex.by.survived,
           color=c("steelblue","firebrick"),
           main="Sex x Survival, Titanic Passengers")

Expected counts

If the two categorical variables were independent than we would expect that the count in each cell of the contigency table to equal the product of the marginal probabilities times the total number of observations.

Here are the marginal probabilities for sex and survival in the Titanic data set:

# marginal probabilities for rows of table (sex)
margin.table(sex.by.survived, margin = 1)/sum(sex.by.survived)
sex
female   male 
0.3526 0.6474 
# marginal probabilities for cols of table (survived)
margin.table(sex.by.survived, margin = 2)/sum(sex.by.survived)
survived
    0     1 
0.658 0.342 

Calculating the expected counts

We can calculate the expected counts based the marginal probabilities as follows:

observed.counts <- sex.by.survived

# get sum over rows and columns
total.ct <- sum(observed.counts)
total.ct
[1] 1313
# get sums and probabilities over rows
row.sums <- margin.table(observed.counts, margin = 1)
row.probs <- row.sums/total.ct
row.sums
sex
female   male 
   463    850 
# get sum over columns
col.sums <- margin.table(observed.counts, margin = 2)
col.probs <- col.sums/total.ct
col.sums
survived
  0   1 
864 449 
# calculate expected probabilities
# the `outer` function gives the outer product -- all the pairwise multiples
# of the two vectors passed to it
expected.probs <- outer(row.probs, col.probs)
expected.probs
        survived
sex          0      1
  female 0.232 0.1206
  male   0.426 0.2214
# and now calculate the expected counts
expected.counts <- expected.probs * total.ct
expected.counts
        survived
sex          0     1
  female 304.7 158.3
  male   559.3 290.7

As a reminder, here are the observed counts. Compare this to the expected counts calculated above:

observed.counts
        survived
sex        0   1
  female 156 307
  male   708 142

The \(\chi^2\)-statistic

To summarize the overall divergence between observed and expected values we’ll calculate a new test statistic, we’ll call \(\chi^2\) (chi-squared), which is defined as:

\[\begin{align} \chi^2 &= \frac{(O_{11} - E_{11})^2}{E_{11}} + \frac{(O_{12} - E_{12})^2}{E_{12}} + \cdots + \frac{(O_{mn} - E_{mn})^2}{E_{mn}} \\ \\ &= \sum_{i=1}^{m}\sum_{j=1}^{n} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \end{align}\]

where \(m\) and \(n\) are the number of categories of the two variables under consideration.

We can think of the \(\chi^2\)-statistic as the sum of the normalized squared differences between observed and expected counts (where each cell is normalized by the expected counts).

Here’s the calculation of the \(\chi^2\)-statistic for the Titanic example:

titanic.chisq <- sum((observed.counts - expected.counts)**2/expected.counts)
titanic.chisq
[1] 327.7

The \(\chi^2\)-distribution

Under the null hypothesis that the categorical variables are independent the \(\chi^2\)-statistic has a sampling distribution called the \(\chi^2\)-distribution. The \(\chi^2\)-distribution takes a single parameter, degree of freedom (df), which for the \(\chi^2\) test of independence is equal to \((m-1)(n-1)\) where \(m\) and \(n\) are the number of categories in the two variables.

For the Titanic example, \(m = 2\) and \(n = 2\), therefore the degrees of freedom, \(df = 1\). The \(\chi^2_{df=1}\) distribution looks like this:

library(ggplot2)
x <- seq(0, 8, length.out = 300)
d <- dchisq(x, df = 1)
qplot(x = x, y = d, geom="line",  color=I("steelblue"),
      xlab = "Chi2", ylab="Density")

Here is a figuring demonstrating the \(\chi^2\) distributions for different degrees of freedom:

ggplot(data.frame(x = c(0,8)), aes(x)) + 
  stat_function(fun=dchisq, args=list(df = 1), aes(color="df=1")) + 
  stat_function(fun=dchisq, args=list(df = 2), aes(color="df=2")) + 
  stat_function(fun=dchisq, args=list(df = 3), aes(color="df=3")) + 
  stat_function(fun=dchisq, args=list(df = 5), aes(color="df=5")) +
  scale_colour_manual("df", values = c("steelblue", "firebrick", "forestgreen", "goldenrod"))

Calculating a p-value

We can use pchisq to calculate the probability of observing a \(\chi^2\) value at least as extreme as the one we observed under the null hypothesis of no association among the categorical variables:

pchisq(titanic.chisq, df = 1, lower.tail = FALSE)
[1] 3.039e-73

This is very strong evidence upon which to reject the null hypothesis!

Simple \(\chi^2\) tests in R

\(\chi^2\) tests are a very common type of analysis. Therefore R has a built-in chisq.test function to make the calculations easy. Here’s how to use it:

chi2.titanic <- chisq.test(titanic$sex, titanic$survived, correct = FALSE)

# summary overview, note rounding
chi2.titanic

    Pearson's Chi-squared test

data:  titanic$sex and titanic$survived
X-squared = 330, df = 1, p-value <2e-16
# observed counts
chi2.titanic$observed
           titanic$survived
titanic$sex   0   1
     female 156 307
     male   708 142
# expected counts
chi2.titanic$expected
           titanic$survived
titanic$sex     0     1
     female 304.7 158.3
     male   559.3 290.7
# chi2 statistic
chi2.titanic$statistic
X-squared 
    327.7 
# corresponding p-value
chi2.titanic$p.value
[1] 3.039e-73

Note that I set correct = FALSE. This stops R from applying what is called “Yate’s Continuity Correction”, which is only appropriate when one or more of the counts are very small (<10).

Note too that the summary output from chisq.test rounds the \(\chi^2\) statistic and the corresponding p-value, so if you want the exact values you need to get them explicitly.

Your turn: Bumpus’ bird data

Bumpus (1898) described a sample of house sparrows which he collected after a very severe storm. The sample included 136 birds, sixty four of which perished during the storm. Also included in his description were a variety of morphological measurements on the birds and information about their sex and age (for male birds). This data set has become a benchmark in the evolutionary biology literature for demonstrating methods for analyzing natural selection. The bumpus data set is available from the class website as tab-delimited file called bumpus-data.txt .

Bumpus, H. C. 1898. The elimination of the unfit as illustrated by the introduced sparrow, Passer domesticus. (A fourth contribution to the study of variation.) Biol. Lectures: Woods Hole Marine Biological Laboratory, 209–225.

Assignment (5 pts)

The variables “sex” and “survived” contain information on the sex of each bird, and whether they survived the storm. Carry out a \(\chi^2\)-test to test the hypothesis that there is a relationship between sex and survival. What is your conclusion based on this test? Show your work.