The possums data set is a tab-delimited file that contains morphological measurements on 104 brushtail possums (Trichosurus vulpecula). This data set contains information on variables such as sex, age, head length, skull width, etc.
Brushtail possums, source: Wikipedia
Previous studies of brushtail possums in Australia have established that the mean tail length of adult possums is 37.86cm. I am studying an isolated population of possums in the state of Victoria, and I am interested in whether mean tail length of Victorian possums is shorter than that of possums in the rest of Australia.
We begin the process of statistical inference by formalizing our question of interest in terms of a “null hypothesis” and an “alternative hypothesis”.
Since I have an a priori reason to believe the difference in tail length is shorter, this is a “one-tailed” hypothesis test (pun intended!).
library(ggplot2)
library(dplyr)
library(cowplot)
# read the possum data from the course repository
possums <- read.delim("https://github.com/Bio204-class/bio204-datasets/raw/master/possum.txt")
#possums <- read.delim("/Users/pmagwene/gits/github/bio204-datasets/possum.txt")
# examine the data set
dim(possums)
[1] 104 8
names(possums)
[1] "site" "pop" "sex" "age" "headL" "skullW" "totalL" "tailL"
# get the subset of the possums representing the Victoria population
vic <- filter(possums, pop == "Vic")
null.mean <- 37.86 # mean under null hypothesis
vic.mean <- mean(vic$tailL)
vic.sd <- sd(vic$tailL)
min.tailL <- min(vic$tailL)
max.tailL <- max(vic$tailL)
tail.range <- seq(min.tailL, max.tailL, length.out = 100)
# histogram and corresponding normal with same mean and sd
p1 <- ggplot() +
geom_histogram(aes(x = vic$tailL, y = ..density..), bins = 9) +
geom_line(aes(x = tail.range,
y = dnorm(tail.range, mean = vic.mean, sd = vic.sd)),
color = 'red', size = 1) +
labs(x = "Tail Length", y = "Density", title = "Distribution of Victoria Possum\nTail Lengths")
# to create the normal probability plot line in ggplot we need to explicitly
# calculate the slope and intercept
observed.q1 <- quantile(vic$tailL, 0.25)
observed.q3 <- quantile(vic$tailL, 0.75)
predicted.q1 <- qnorm(0.25)
predicted.q3 <- qnorm(0.75)
slope <- (observed.q3 - observed.q1)/(predicted.q3 - predicted.q1)
intercept <- observed.q1 - slope * predicted.q1
# note generate the plot
p2 <- ggplot() +
geom_abline(slope = slope, intercept = intercept,
color = 'red', size = 1) +
geom_qq(aes(sample = vic$tailL))
plot_grid(p1, p2)
Confidence intervals provide plausible ranges of values for statistics of interest, such as the mean. They are useful because we recognize that the point estimates we calculate from any given sample will vary somewhat if we were to repeat the experiment of interest again. Confidence intervals are based on our understanding of the sampling distribution of the statistic of interest.
Recall that given a population with a mean \(\mu\) and a standard deviation \(\sigma\), the sampling distribution of the mean for samples of size \(n\) is normally distributed, \(N(\mu, \sigma/\sqrt{n})\). We refer to the standard deviation of the sampling distribution of a statistic as the standard error of that statistic. Confidence intervals are calculated in terms of multiples of the standard error of the statistic of interest. For example, because the sampling distribution of the mean is normally distributed, we know that ~95% of sample means should fall within about 2 (more precisly 1.96) standard errors of the population mean. Our knowledge of the sampling distribution thus allows us to estimate intervals with difference levels of confidence.
# This is an example of generating a set of simulated samples of size n
# without resorting to for loops
n <- length(vic$tailL)
SE.vic <- vic.sd/sqrt(length(vic$tailL))
# generate 1000 samples of size n as a single vector
sim.tailL <- rnorm(1000 * n, mean = vic.mean, sd = vic.sd)
# split into individual samples of siz en
sim.samples <- split(sim.tailL, cut(seq_along(sim.tailL), 1000, labels = FALSE))
# calculate mean for each simulated sample
sim.means <- sapply(sim.samples, mean)
ggplot() +
geom_histogram(aes(x = sim.means, y = ..density..), bins=25) +
geom_vline(xintercept = vic.mean, color = 'red') +
geom_vline(xintercept = vic.mean - 1.96*SE.vic, color = 'red', linetype = 'dashed') +
geom_vline(xintercept = vic.mean + 1.96*SE.vic, color = 'red', linetype = 'dashed') +
labs(title = "Simulated sampling distribution of\nmean tail length for Victoria population\nand 95% CI for mean",
x = "Mean tail length", y = "Density")
One way we can decide whether to reject the null hypothesis is to ask whether the mean under the null hypothesis falls within some reasonably large confidence intervals (e.g. 95% CI) for the mean, as estimated from the Victoria data set. If not, than that provides some evidence to reject the null hypothesis.
SE.tailL <- vic.sd/sqrt(length(vic$tailL))
CI95.tailL <- c(vic.mean - 1.96*SE.tailL, vic.mean + 1.96*SE.tailL)
CI95.tailL
[1] 35.44 36.43
The 95% confidence interval for the mean tail length in the Victoria possum population is (35.4383, 36.4312)
ggplot() +
geom_vline(xintercept = null.mean, linetype = 'dashed') +
annotate("text", x = 1.025 * null.mean, y = 1.5, label = "H0: mu = 37.86") +
geom_point(aes(x = vic.mean, y = 1), color = 'red') +
geom_line(aes(x = CI95.tailL, y = 1), color = 'red') +
annotate("text", x = vic.mean, y = 1.5, label = "Victoria mean\nand 95% CI", color = 'red') +
xlim(35,40) + ylim(0,3) + labs(x = "Tail Length (cm)", y = "")
Because the mean under the null hypothesis is very far from the 95% CI for the mean in the Victoria population, we take this as fairly strong evidence against the null hypothesis (i.e. we judge it unlikely that the Victoria sample was drawn from a population with the same mean tail length observed in the rest of Australia).
The CI approach above let us assess, in a general way, whether there’s evidence to reject the null hypothesis. In the above example, we might state our finding something along the lines of “We are 95% confident that the mean tail length of Victoria possums is between 35.44 and 36.43mm. We therefore reject the null hypothesis that Victoria possums have the same mean tail length as other Australian possums.” We would make the same statement for any other estimated 95% CI that did not overlap 37.86. Thus the CI approach doesn’t really tell us anything about the strength of evidence against the null hypothesis, it just gives us a basis by which to reject it (or fail to reject it).
To assess the strength of the evidence against the null hypothesis, we can ask “what is the probability of observing data (or a statistic of interest) at least as favorable to the alternative hypothesis, if the null hypothesis were true?”
When considering a statistic of interest the usual way to phrase this question is with respect to the expected sampling distribution of the statistic of interest under the null hypothesis. For example, when testing a hypothesis involving means, we would ask “What is the probability of observing my sample mean, with respect to the expected distribution of sample means, if the null hypothesis was true.”
Let’s apply this to our possum tail example. Our null hypothesis, \(H_0\), is that the mean tail length in Victoria possums should be the same as the rest of Australia (\(\mu=37.86\)). Let’s assume the standard deviation of tail length under the null and alternative hypothesis is the same, so that we can estimate the sampling distribution of the mean under the null hypothesis by using our sample standard deviation.
First we calculate the expected distribution of the sample means under the null hypothesis and plot this sampling distribution. We then draw the observed sample mean for comparison.
null.mean = 37.86
null.se <- SE.tailL # because we assumed H0 and HA have same SE
H0.plot <-
ggplot(data.frame(x = c(36.5, 39)), aes(x)) +
stat_function(fun = dnorm, args = list(mean = null.mean, sd = null.se), n = 200) +
labs(x = "Mean Tail Length (cm)", y = "Density",
title = "Sampling Distrbution of Mean Tail Length\nUnder the Null Hypothesis") +
xlim(34,40)
H0.plot +
geom_segment(aes(x = vic.mean, y = 0.5,
xend = vic.mean, yend = 0),
color = 'red',
arrow = arrow(length = unit(0.5, "cm"))) +
annotate("text", x = vic.mean, y = 0.6, label = "Victoria mean", color = 'red')
It looks like our observed mean is very far to the left of the expected distribution of the samples mean for the given null hypothesis. Let’s express that as a Z-score. Recall that Z-scores are defined in terms of the number of standard deviations an observation sits from the mean \(\frac{x - \mu}{\sigma}\). In the present example, \(x = \overline{x}\), \(\mu = \mu_{H_0}\), and \(\sigma = {SE}_{H_0}\).
z = (vic.mean - null.mean)/null.se
z
[1] -7.601
Our calculation confirms our graphical intuition. Our observed sample mean is 7.6 standard errors to the left of the the mean under the null hypothesis!
Since this is a one-sided hypothesis test, with our alternative hypothesis to the left of the null hypothesis, we only have to the sum probability to the left of our observed value of the mean.
p.value.1sided <- pnorm(abs(z), lower.tail = FALSE)
p.value.1sided
[1] 1.472e-14
That is a vanishingly small probability, so we consider this strong evidence to reject the null-hypothesis.
Let’s test another hypothesis, this time a two-side one. We wish to test the hypothesis that the mean total body length of Victoria possums is not equal to the mean total length of other Australian possums, which previous studies have shown to be 86.8 cm.
Because in this case, we have no a priori hypotheses about the potential direction of the difference in average body length, this is a “two-tailed” hypothesis test (i.e. we want to consider the possibility that Victoria possums have either shorter or longer bodies than other Australian possums).
First a histogram:
qplot(vic$totalL, bins = 9)
Normal probability plot
qqnorm(vic$totalL)
qqline(vic$totalL)
Our histogram and probability plot raise some questions about the assumption of approximately normality. However we have a decently large sample size (\(n = 46\)) and we’re testing hypotheses about the equality of the mean (which tends to be fairly robust to departures from normality), so we decide to proceed (cautiously) under the normal assumption.
null.totalL <- 86.8
mean.totalL <- mean(vic$totalL)
sd.totalL <- sd(vic$totalL)
SE.totalL <- sd.totalL/sqrt(length(vic$totalL))
CI95.totalL <- c(mean.totalL - 1.96*SE.totalL, mean.totalL + 1.96*SE.totalL)
CI95.totalL
[1] 86.07 88.87
ggplot() +
geom_vline(xintercept = null.totalL, linetype = 'dashed') +
annotate("text", x = 1.01 * null.totalL, y = 2.5, label = "H0: mu = 86.8") +
geom_point(aes(x = mean.totalL, y = 1), color = 'red') +
geom_line(aes(x = CI95.totalL, y = 1), color = 'red') +
annotate("text", x = mean.totalL, y = 1.5, label = "Victoria mean total length\nand 95% CI", color = 'red') +
xlim(85,90) + ylim(0,3) + labs(x = "Total Length (cm)", y = "")
H0.totalL <-
ggplot(data.frame(x = c(84, 90)), aes(x)) +
stat_function(fun = dnorm, args = list(mean = null.totalL, sd = SE.totalL), n = 200) +
labs(x = "Mean Total Length (cm)", y = "Density",
title = "Sampling Distrbution of Mean Total Length\nUnder the Null Hypothesis") +
xlim(84,90)
H0.totalL +
geom_segment(aes(x = mean.totalL, y = 0.2,
xend = mean.totalL, yend = 0),
color = 'red',
arrow = arrow(length = unit(0.5, "cm"))) +
annotate("text", x = mean.totalL, y = 0.3, label = "Victoria mean\ntotal length", color = 'red')
z.totalL = (mean.totalL - null.totalL)/SE.totalL
z.totalL
[1] 0.9341
x.totalL <- seq(84,90,length.out = 200)
density.totalL <- dnorm(x.totalL, mean = null.totalL, sd = SE.totalL)
left.area <- seq(84, null.totalL - z.totalL * SE.totalL, length.out = 100)
right.area <- seq(null.totalL + z.totalL * SE.totalL, 90, length.out= 100)
ggplot() +
geom_line(aes(x = x.totalL, y = density.totalL)) +
geom_area(aes(x = left.area, y = dnorm(left.area, mean = null.totalL, sd = SE.totalL)),
fill = "gray") +
geom_area(aes(x = right.area, y = dnorm(right.area, mean = null.totalL, sd = SE.totalL)),
fill = "gray")
p.value.2sided <- 2 * pnorm(abs(z.totalL), lower.tail = FALSE)
p.value.2sided
[1] 0.3503
In standard statistical hypothesis testing there are two real possibilities – the null hypothesis is true or the alternative hypothesis is true. When you carry out a hypothesis test, there are two possible test outcomes – you reject the null hypothesis or you fail to reject the null hypothesis. It is typical to represent the different combinations of the reality / statistical tests in a table like the following:
do not reject \(H_0\) | reject \(H_0\) | |
---|---|---|
\(H_0\) true | okay | Type 1 error (false positive), \(\alpha\) |
\(H_A\) true | Type 2 error (false negative), \(\beta\) | okay |
When we specify a significance threshold, \(\alpha\), for hypothesis testing, this controls the Type I error (false positive rate) of our test. The false negative rate is often referred to as \(\beta\). In general, there is a tradeoff between the false positive and false negative rate – the lower the false positive rate the higher the false negative rate, and vice versa.