There are a number of packages (e.g. coin) that include functions for carrying out randomization/permutation tests in R. However, it’s often just as easy to write a quick set of functions to carry out such tests yourself. We’ll illustrate a simple example of this using the ``jackal" example from Manly (2006).
Consider the following data set composed of measures of mandible lengths (in mm) for male and female golden jackals. This set of measurements was taken from a set of skeletons in the collections of the British Museum of Natural History.
Let’s first create two vectors to represent this set of measurements
males <- c(120,107,110,116,114,111,113,117,114,112)
females <- c(110,111,107,108,110,105,107,106,111,111)
jackals <- data.frame(sex = c(rep("m",10),rep("f",10)), mandible.length = c(males,females))
And we’ll create a simple strip chart to visualize the data
library(ggplot2)
ggplot(jackals, aes(x = sex, y = mandible.length, color = sex)) +
geom_jitter(width=0.2, height=0,size=2,alpha=0.5)
mean(males)
[1] 113.4
mean(females)
[1] 108.6
mean(males) - mean(females)
[1] 4.8
The hypothesis we want to test is that male jackals have, on average, larger mandibles than female jackals. The strip plot we constructed and difference in the means would seem to suggest so but let’s carry out some more formal tests. The obvious way to compare this set of measurements would be to carry out a t-test, which is approropriate if the samples are normally distributed with approximately equal variance. We have small samples here, so it’s hard to know if the normal distribution holds. Instead we’ll use a randomization test to compare the observed difference in the means (4.8) to the distribution of differences we would expect to observe if the labels male' andfemale’ were randomly applied to samples of equal size from the data at hand.
Let’s create a function that takes a sample and randomly assigns the observations to two groups of a specified size. The function takes as input a vector of values (size \(N\)) and two integers representing the sample sizes (\(n_1\) and \(n_2\) where \(n_1 + n_2 = N\)) of the two groups to be compared.
two.group <- function(x,n1,n2){
# sample w/out replacement
reordered <- sample(x, length(x)) # see help(sample) for more info
g1 <- reordered[seq(1,n1)]
g2 <- reordered[seq(n1+1,n1+n2)]
list(g1,g2)
}
Test out this function by calling it repeatedly as shown below. You’ll see that it returns a random reordering of the original data, split into two groups:
jackals <- c(males,females)
two.group(jackals,10,10)
[[1]]
[1] 110 114 110 111 117 107 105 113 112 114
[[2]]
[1] 111 111 120 107 116 107 110 111 106 108
two.group(jackals,10,10) # call it again to get a different sample
[[1]]
[1] 107 111 110 114 111 117 107 111 116 110
[[2]]
[1] 120 106 105 110 112 107 108 113 114 111
Now let’s write a simple function that returns the mean difference between two samples:
mean.diff <- function(x1,x2) {
mean(x1) - mean(x2)
}
Now let’s write a generic randomization function:
set.seed(20161204)
randomization <- function(x1,x2,fxn,nsamples=100){
stats <- c()
orig <- c(x1,x2)
for (i in 1:nsamples){
g <- two.group(orig, length(x1), length(x2))
stats <- c(stats, fxn(g[[1]],g[[2]]))
}
return (stats)
}
We can then use the |randomization| function we wrote as follows to evaluate the signficance of the observed difference in means in the original sample:
# generate 1000 samples of the mean.diff for randomized data
rsample <- randomization(males,females,mean.diff,1000)
# examine the distribution
quickplot(rsample, bins=20,
main="Histogram of randomized differences\nin male and female means",
xlab = "mean(males) - mean(females)")
# in how many of the random samples is mean difference between the two groups
# as great or larger than the observed difference in our original samples?
# you might get a slightly different answer
ngreater <- sum(rsample >= 4.8)
ngreater
[1] 3
So our conclusion is that the probability of getting a mean difference between samples of this size is about \(3/1000=0.003\). Note that we can’t generalize this to golden jackals as a whole because we know nothing about whether these samples actually represent random samples of the golden jackal population or biases that might have been imposed on the collection (e.g. maybe the collectors liked to single out particularly large males). However, if we saw a similar trend (males larger than females) in multiple museum collections we might see this as supporting evidence that the trend held true in general.
Note that we wrote our randomization function to take an arbitrary function that takes as it’s input two vectors of data. That means we can use it to estimate the randomized distribution of arbitrary statistics of interest. Here we illustrate that with a function that calculates the ratio of variances. This for example could be used to assess the null hypothesis that male and female jackals have similar variation in mandible length.
ratio.var <- function(x1,x2){
var(x1)/var(x2)
}
ratio.var(males,females) # ratio of variances for the original samples
[1] 2.681
vsample <- randomization(males,females,ratio.var, 1000)
quickplot(vsample, bins=20,
main="Histogram of randomized ratios\nof male and female variances",
xlab = "var(males)/var(females)")
mean(vsample)
[1] 1.277
sum(vsample >= 2.68)
[1] 75
sum(vsample >= 2.68)/1000.
[1] 0.075
In this case the observed ratio of variances isn’t particularly unusual. Let’s make one more comparison. We know (or at least we should know!) that ratios of variances have an \(F\)-distribution so let’s compare the distribution of ratios of variances from our randomized sample to that of a sample of the same size drawn from the \(F\)-distribution with the same degrees of freedom.
randomF <- rf(1000, 9, 9) # see help(rf)
ggplot() +
geom_density(aes(vsample, color='randomization')) +
geom_density(aes(randomF, color='theoretical')) +
labs(x = "Ratio of Variances") +
scale_color_manual("Type", values = c("steelblue", "firebrick"))
Jackknife estimates of simple statistics are also relatively straightforward to calculate in R. Here’s an example of a simple jackknife function:
jknife <- function(x, fxn, ci=0.95) {
theta <- fxn(x)
n <- length(x)
partials <- rep(0,n)
for (i in 1:n){
partials[i] <- fxn(x[-i])
}
pseudos <- (n*theta) - (n-1)*partials
jack.est <- mean(pseudos)
jack.se <- sqrt(var(pseudos)/n)
alpha = 1-ci
CI <- qt(alpha/2,n-1,lower.tail=FALSE)*jack.se
jack.ci <- c(jack.est - CI, jack.est + CI)
list(est=jack.est, se=jack.se, ci=jack.ci)
}
The bootstrap package (install if necessary) contains a very similar implementation of a jackknife function (jackknife()).
Let’s illustrate our jackknife function using samples drawn from a Poisson distribution. The Poisson is a discrete probability distribution that is often used to describe the probability of a number of events occuring in a fixed period of time, where the events are independent and occur with an average rate \(\lambda\). The Poisson distribution is used to model processes like mutations in DNA sequences or atomic decay. Both the mean and variance of a Poisson distribution are equal to \(\lambda\). Let’s see how well the jackknife does at estimating confidence intervals for the mean and variance of a modest number of samples drawn from a Poisson.
psample <- rpois(25,4) # 25 obsevations from poisson with lambda = 4
psample # your sample will be different
[1] 1 4 2 5 4 5 2 7 2 5 3 8 4 3 3 3 4 2 4 2 1 3 2 3 3
mean(psample)
[1] 3.4
var(psample)
[1] 2.833
jknife(psample, mean)$ci
[1] 2.705 4.095
jknife(psample, var)$ci
[1] 0.7038 4.9628
In both cases above, the true mean and variance were contained within the 95% confidence intervals estimated by the jackknife. Let’s do a little experiment to see how often that’s true for samples of this size:
# create 500 samples of size 25 drawn from Poisson w/lambda=4
psamples <- matrix(rpois(25*500,4),ncol=25,byrow=T)
dim(psamples)
[1] 500 25
# create a convenience function
get.ci <- function(x) { return(x$ci) } #x$ci gives confidence interval
# generate jackknife estimates for mean
j.mean <- apply(psamples, 1, jknife, mean)
# make matrix that holds 95% confidence intervals of mean
mean.ci <- t(sapply(j.mean, get.ci))
mean.ci[1,]
[1] 3.218 5.022
mean.ci[2,]
[1] 2.569 4.071
# check how often true mean is w/in CI
includes.true.mean <- sum(mean.ci[,1] <=4 & mean.ci[,2] >= 4)
includes.true.mean
[1] 462
includes.true.mean/500 # true mean is w/in estimated 95% CI about 93% of the time.
[1] 0.924
# now the same for variances
j.var <- apply(psamples, 1, jknife, var)
var.ci <- t(sapply(j.var, get.ci))
includes.true.var <- sum(var.ci[,1] <=4 & var.ci[,2] >= 4)
includes.true.var
[1] 438
includes.true.var/500 # true variance is w/in 95% CI only 88% of time
[1] 0.876
In the case of the confidence intervals for the mean, the jacknife estimator did a decent job – the true mean is with the 95% confidence interval about 93% of the time. In the case of the variance it did less well. The jackknife confidence intervals work well when the estimator is normally distributed. This suggests that one way we might improve the jackknife CIs is by using a normalizing transformation, like the logarithm function:
log.var <- function(x){log(var(x))}
j.log.var <- apply(psamples, 1, jknife, log.var)
log.var.ci <- t(sapply(j.log.var, get.ci))
includes.true.var.transformed <- sum(log.var.ci[,1] <=log(4) & log.var.ci[,2] >= log(4))
# an improvement in the performance of the 95% CIs
includes.true.var.transformed/500
[1] 0.914
This illustrates the type of simulation study you might do to check the robustness of the jackknife for a statistic of interest for a given class of distributions.
There are several packages that provide functions for doing bootstrapping in R. These include bootstrap and boot. We’ll take a quick look at the functions in bootstrap. Install bootstrap using the standard package installation mechanism.
We’ll start with the same set of samples from the Poisson that we used before to illustrate the jackknife.
library(bootstrap)
# generate 1000 bootstrap sample estiamte of var
b <- bootstrap(psample, 1000, var)
# standard bootstrap confidence limits
# based on assumption of normality
bstar <- b$thetastar
ci.multiplier = abs(qt(0.025, df=24))
c(mean(bstar)-ci.multiplier*sd(bstar), mean(bstar)+ci.multiplier*sd(bstar))
[1] 0.7644 4.5555
# estimate the bootstrap percentile confidence limits
quantile(b$thetastar,c(0.025,0.975))
2.5% 97.5%
1.077 4.474
# for comparison remind ourself of what the jackknife CI was
jknife(psample,var)$ci
[1] 0.7038 4.9628
Now let’s use the bootstrap to look at the distribution of a more complicated pair of statistics – the intercept and coefficients of a logistic regression. This time we’ll use the boot package to do the analysis, which is more flexible than the bootstrap package. Install the boot package as necessary.
To illustrate this let’s first carry out a logistic regression predicting survival as a function of body weight for the Bumpus bird data set we looked at previously.
bumpus <- read.delim("https://raw.githubusercontent.com/Bio204-class/bio204-datasets/master/bumpus-data.txt")
names(bumpus)
[1] "line.number" "sex"
[3] "age..a...adult..y...young." "survived"
[5] "total.length..mm." "alar.extent..mm."
[7] "weight..g." "length.of.beak.and.head"
[9] "length.of.humerus..in." "length.of.femur..in."
[11] "length.of.tibiotarsus..in." "width.of.skull..in."
[13] "length.of.sternal.keel..in."
First off, let’s fix the variable names a little bit. You’ll see that when R imported this data file it replaced spaces, commas, and parentheses in column names with periods. Let’s use what’s called a “regular expression” to match cases where there are more than two successive periods in the variable name and replace that with a single period.
# see the gsub documentation for more details
names(bumpus) <- gsub("[.]{2,}", ".", names(bumpus))
names(bumpus)
[1] "line.number" "sex"
[3] "age.a.adult.y.young." "survived"
[5] "total.length.mm." "alar.extent.mm."
[7] "weight.g." "length.of.beak.and.head"
[9] "length.of.humerus.in." "length.of.femur.in."
[11] "length.of.tibiotarsus.in." "width.of.skull.in."
[13] "length.of.sternal.keel.in."
Now that the variable names are more to our liking, let’s fit the logistic regression model (see Introduction to Logistic Regression for more details) and visualize the regression using ggplot.
# convert TRUE/FALSE values to 1/0 values, to satisfy geom_smooth
bumpus$survived <- as.integer(bumpus$survived)
# fit the model
fit.survival <- glm(survived ~ weight.g., family = binomial, bumpus)
# draw the regression plot
ggplot(bumpus, aes(x=weight.g., y=survived)) +
geom_jitter(width = 0, height = 0.1) +
geom_smooth(method="glm", method.args = list(family="binomial"), se=FALSE) +
labs(x = "Weight (g)", y = "Prob. Survival")
The logistic regression suggests that larger birds were somewhat less likely to survive the storm. Let’s look at a summary of the logistic regression model:
summary(fit.survival)
Call:
glm(formula = survived ~ weight.g., family = binomial, data = bumpus)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.633 -1.165 0.858 1.079 1.463
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.046 3.252 2.47 0.013 *
weight.g. -0.311 0.127 -2.44 0.015 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 188.07 on 135 degrees of freedom
Residual deviance: 181.58 on 134 degrees of freedom
AIC: 185.6
Number of Fisher Scoring iterations: 4
We’re most interested in the estimated intercept and regression coefficients, which we can get as follows:
fit.survival$coefficients
(Intercept) weight.g.
8.0456 -0.3105
Recall that the bivariate logistic regression is defined by the model:
\[ P(Y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]
where \(\beta_0\) is the intercept and \(\beta_1\) is the regression coefficient with respect to the explanatory variable.
bootThe boot function defined in the boot package requires at least three arguments: boot(data, statistic, R,...). 1) data is the data frame, matrix or vector you want to resample from; 2) statistic is a function which when applied to data returns a vector containing the statistic of interest; and 3) R is the number of bootstrap replicates to generate.
The function passed as the statistic argument to boot must take at least two arguments – the first is the original data, and the second is a vector of indices defining the observations in the bootstrap sample. Here’s a suitable function that will calculate the logistic regression coefficients that we want.
library(boot)
logistic.reg.coeffs <- function(x, indices) {
fit.model <- glm(survived ~ weight.g., family = binomial, x[indices,])
reg.b0 <- fit.model$coefficients[[1]] # intercept
reg.b1 <- fit.model$coefficients[[2]] #
return(c(reg.b0, reg.b1))
}
Having defined this function, we can carry out the bootstrap as follows:
# generate 500 bootstrap replicates
nreps <- 500
reg.boot <- boot(bumpus, logistic.reg.coeffs, nreps)
The object returned by the boot function has various components. First, let’s look at the bootstrap replicates of our statistic(s) of interest, which are stored in a matrix called t.
# the first column of t corresponds to the intercept
# the second column to the coefficient w/respect to the explanatory
# variable.
dim(reg.boot$t)
[1] 500 2
We can look at histograms of the bootstrap estimates of the intercept and regression coefficient:
quickplot(reg.boot$t[,1], bins = 25, xlab="Regression Intercept")
quickplot(reg.boot$t[,2], bins = 25, xlab="Regression Coefficient")
We can visualize the 95% confidence intervals based on the set of bootstrapped logistic regressions as follows:
predicted.y <- function(x, coeffs){
1.0/(1.0 + exp(-(coeffs[1] + coeffs[2]*x)))
}
# setup x-values over which to make predictions
nx <- 200
x <- seq(22, 32, length.out = nx)
# create empty matrix to hold bootstrap predicdtions
predicted.mtx <- matrix(nrow=nreps, ncol = nx)
for (i in 1:nreps) {
predicted.mtx[i,] <- predicted.y(x, reg.boot$t[i,])
}
sample.prediction <- predict(fit.survival,
data.frame(weight.g. = x),
type = "response")
quantile.975 <- function(x){ quantile(x, 0.975) }
quantile.025 <- function(x){ quantile(x, 0.025) }
ggplot() + xlim(22, 32) + ylim(0,1) +
geom_jitter(aes(x=bumpus$weight.g., y=bumpus$survived),
width = 0, height = 0.1) +
geom_line(aes(x = x, y = sample.prediction), color='red') +
geom_line(aes(x = x, y = apply(predicted.mtx, 2, quantile.975))) +
geom_line(aes(x = x, y = apply(predicted.mtx, 2, quantile.025))) +
labs(x = "Weight (g)", y = "Prob. Survival",
title="Bumpus Survival Data\nLogistic Regression and Bootstrap 95% CI")
We can compare out bootstrap 95% confidence band to the confidence intervals calculated by geom_smooth, which uses an asymptotic approximation.
# draw the regression plot
ggplot(bumpus, aes(x=weight.g., y=survived)) +
geom_jitter(width = 0, height = 0.1) +
geom_smooth(method="glm", method.args = list(family="binomial"), se=TRUE) +
labs(x = "Weight (g)", y = "Prob. Survival")