Introduction
Example data set: Iris
Scatter plots
- Setting aspect ratios
- Marginal histograms and density plots with ggExtra
Bivariate density plot
Combining Scatter Plots and Density Plots with Categorical Information
Facet plots in ggplot
Measure of Association: Covariance and Correlation

Introduction

Having already explored various ways to visualize the distributions of single variables, we now turn our attention to bivariate and multivariate representations.

library(dplyr)
library(ggplot2)

Example data set: Iris

Initially we’ll use just the I. setosa specimens from the Iris data set for our visualizations.

setosa <- filter(iris, Species == "setosa")

Scatter plots

A scatter plot is one of the simplest representations of a bivariate distribution.

Scatter plots are simple to create in ggplot simply by specifying the appropriate X and Y variables in the aesthetic mapping and using geom_point for the geometric mapping.

ggplot(setosa, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()

I’m not crazy about the defaults in ggplot, so let’s tweak our plot a little:

setosa.length.vs.width <- ggplot(setosa, aes(x = Sepal.Length, y = Sepal.Width))

clean.theme <- theme_bw() +                 # removes the background
  theme(panel.grid.major = element_blank()) # remove the heavy grid lines

labels <- labs(x = "Sepal Length (cm)", y = "Sepal Width (cm)")

setosa.length.vs.width + 
  geom_point(alpha = 0.25, size = 2) + # tweak alpha and size of points
  labels + clean.theme

You’ll recall that there was a fair amount of overlap between the observations – you can tell this is the case because when we set the alpha transparency in the previous figure some of the plotted points look darker than other. We can substitute geom_jitter for geom_point if we wish to add jitter to the points in our scatter plot.

setosa.length.vs.width + 
  geom_jitter(alpha = 0.25, size = 2,           # tweak shape and size
              width = 0.05, height = 0.05) +    # set degree of jitter
  labels + clean.theme

Setting aspect ratios

When the X and Y variables of a scatter plot are measured on the same scale and have comparable variance it’s sometimes useful to draw them plot with an aspect ratio of 1, meaning that a unit distance in the x-direction is the same as unit distance in the y-direction.

setosa.length.vs.width + geom_point(alpha = 0.25, size = 2) + labels + clean.theme +
  theme(aspect.ratio = 1) +       # set aspect ratio
  xlim(3.5,6.5) + ylim(2,5)       # set limits for x- and y- axes

Marginal histograms and density plots with ggExtra

The ggExtra package provides some useful additional functions for working with ggplot. We’ll look at a function called ggMarginal that provides a convenient way to plot histograms or density plots on the margin of a scatter plot. First you’ll need to install ggExtra using the RStudio GUI or the command install.packags("ggExtra").

library(ggExtra)

p <- setosa.length.vs.width + geom_point(alpha = 0.25, size = 2) + 
  labels + clean.theme + theme(aspect.ratio = 1) 

ggMarginal(p, type = "histogram", bins = 9)

ggMarginal can also generate marginal density plots and boxplot.

ggMarginal(p, type = "density")

Bivariate density plot

The density plot, which we introduced as a visualization for univariate data, can also be depicted in two-dimensions. It 2D density plot has the same interpretation as in 1D, but not with respect to a bivariate distribution. The contours of the 2D density plot indicate contours of estimated equal density.

setosa.length.vs.width + geom_density2d() + labels + clean.theme

We can make the relationship between the 2D density plot and the scatter plot can be made clearer if we combine the two:

setosa.length.vs.width + geom_density_2d() + geom_jitter(alpha=0.35) +
  labels + clean.theme + theme(aspect.ratio = 1)

Combining Scatter Plots and Density Plots with Categorical Information

As with many of the univariate visualizations we explored, it is often useful to depict bivariate relationships as we change a categorical variable. Here are some examples of this:

all.length.vs.width <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))


all.length.vs.width + 
  geom_point(aes(color = Species, shape = Species), size = 2, alpha = 0.6) +
  labels

Notice how in our aesthetic mapping we specified that both color and shape should be used to represent the Species categories.

The same thing can be accomplished with a 2D density plot.

all.length.vs.width + 
  geom_density_2d(aes(color = Species)) +
  labels

Facet plots in ggplot

As you can see, in the density plots above, when you have multiple categorical variables and there is significant overlap in the range of each sub-distribution, figures can become quite busy. One way to deal with this is to create a separate subplot for each distribution. ggplot provides a mechanism for doing this called a facet_grid.

all.length.vs.width + geom_point(aes(color = Species)) + 
  clean.theme + theme(aspect.ratio = 1) + labels + 
  facet_grid(. ~ Species)

The first argument to facet_grid is a “formula” that specifies what is drawn with respect to the rows and columns of the grid. The formula can be generalized as LHS ~ RHS (LHS = left hand side, RHS = right hand side) where the LHS gives the rows and the RHS give the columns. A dot in the formula indicates no faceting in that dimension.

Here’s a combination of scatter plots and 2D density plots, embedded in a facet grid.

all.length.vs.width + 
  geom_density_2d(aes(color = Species), alpha = 0.5) + 
  geom_point(aes(color = Species), alpha=0.5, size=1) + 
  facet_grid(. ~ Species) +
  clean.theme + labels + 
  theme(aspect.ratio = 1, legend.position = "none") + 
  labs(title = "Iris Sepal Morphology")

Measure of Association: Covariance and Correlation

A bivariate scatter plot naturally prompts us to ask if there is some relationship between the variables being plotted. The simplest type of relationships we consider in statistics are linear relationships. Our standard measure of linear association are covariance and correlation.

Covariance

Let \(X = \left\{x_1, x_2,\ldots, x_n\right\}\) and \(Y= \left\{y_1, y_2,\ldots, y_n\right\}\) represent two variables, each measured on the same set of \(n\) observations (individuals) in our sample.

The population covariance is defined as: \[ \sigma_{XY} = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{X})(y_i - \bar{Y}) \]

The unbiased sample covariance is defined as: \[ s_{XY} = \frac{1}{n - 1}\sum_{i=1}^n(x_i - \bar{X})(y_i - \bar{Y}) \]

Covariance is a symmetric measure, i.e. cov(X,Y) = cov(Y,X). Covariances are positive when there is a positive relationship between \(X\) and \(Y\) (i.e. when \(X\) increases, \(Y\), on average, increases as well). Covariance are negative when \(X\) and \(Y\) exhibit a negative relationship (i.e. an increase in \(X\) tends to be associated with a decrease in \(Y\)).

The units of covariance are the product of the units of \(X\) and \(Y\). As was the case with variance, it can be hard to directly interpret covariances. A more intuitive measure of association is correlation.

Calculating covariance in R

The cov function in R will calculate the sample covariance between two vectors or vecotr-like objects:

cov(setosa$Sepal.Length, setosa$Sepal.Width)

[1] 0.09921633

If the first argument to cov is a data frame, the function will calculate covariances between all pairs of variables, return the result as a matrix. For this to work, the variables must all be numeric.

setosa.no.species <- select(setosa, -Species) # select all columns except Species
cov(setosa.no.species)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length   0.12424898 0.099216327  0.016355102 0.010330612
Sepal.Width    0.09921633 0.143689796  0.011697959 0.009297959
Petal.Length   0.01635510 0.011697959  0.030159184 0.006069388
Petal.Width    0.01033061 0.009297959  0.006069388 0.011106122

The diagonal elements of the covariance matrix give the variances of each variable. The off-diagonal elements give the corresponding covariances between each pair of variables. Note that the covariance matrix is symmetrical about the diagonal.

Correlation

The correlation between two variables, \(X\) and \(Y\), can be defined in terms of covariance and standard deviations.

The population correlation is given by: \[ \rho_{xy} = \frac{\sigma_{xy}}{\sigma_x\sigma_y} \]

The unbiased sample correlation is given by: \[ r_{xy} = \frac{s_{xy}}{s_x s_y} \]

Correlation is a unitless statistic that can take on values between -1 and 1. Correlations near zero indicate no evidence of linear association. Correlations near 1 (-1) indicate strong positive(negative) lineaer association.

Correlation in R

The cor function in R will calculate the sample covariance between two vectors or vecotr-like objects:

cor(setosa$Sepal.Length, setosa$Sepal.Width)

[1] 0.7425467

As with cov, the cor function can calculate a correlation matrix, giving all pairwise correlations between a set of variables:

cor(setosa.no.species)

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000   0.7425467    0.2671758   0.2780984
Sepal.Width     0.7425467   1.0000000    0.1777000   0.2327520
Petal.Length    0.2671758   0.1777000    1.0000000   0.3316300
Petal.Width     0.2780984   0.2327520    0.3316300   1.0000000

In a correlation matrix, the diagonal elements are always 1 (every variable is perfectly correlated with itself). Like covariance, correlation matrices are symmetric about the diagonal.

Pairs plots to depict multiple pairwise relationships

A very useful visualization of multiple bivariate relationships is a “pairs plot” in which all pairwise relationships between variables of interest are displayed. The package GGally includes a convenient implementation of the pairs plot. Install GGally (install.packages("GGally")) and then generate a pairs plot as shown below:

library(GGally)
ggpairs(setosa.no.species)

You can think of a pairs plot as similar to a correlation matrix. By default the ggpairs function will display univariate density plots along the diagonal, depicting the univariate distributions of each individual variable. The lower diagonal of the plot gives scatter plots, and the upper diagonal gives the corresponding correlations between each pair of variables.

For more info on the GGally ggpairs function, see the ggpairs vignettte.

Heat maps for representing pairwise relationships

When the number of variables is large, a pairs plot can become unreadable. An alternative is to create a “heat map” which depicts pairwise correlations in terms of colors. Very large matrices can be depicted relatively efficiently this way. The GGally package provides a function, ggcorr for creating such plots.

ggcorr(setosa.no.species, 
       size = 3,     # size argument changes text size along diagonal
       label = TRUE) # label adds text giving correlations to cells

For more info on the GGally ggcorr function, see the ggcorr vignettte.

Visualizing and Describing Bivariate Distributions

Paul M. Magwene

September 19, 2016