Alpha diversity descripes the diversity within a sample. Beta diversity, on the other hand, descriptes the diversity between samples. This implies that beta diversity measures can not be calculated for a single sample, but is calculated for all pairs of samples in the dataset.

So, if you have 10 samples, then there is \(9 + 8 + .... + 1 = 45\) beta diversity measures!

Each beta diversity measure measures the similarity between the corresponding two samples. That is a distance in microbiome space. I.e. is \(\beta diversity_{i,j} = 0\) for sample \(i\) and \(j\) then the two samples are exactly similar. If \(\beta diversity_{i,j} >> 0\) is large, then the two samples are very distinct.

Beta diversity measures are by definition non-negative numbers.

While this can seem rather complex, combining beta diversity measures with dimentionallity reduction models such as Principal Component Analysis, comprise a very powerful tool for visualizing similarities and sisimilarities between samples.

From OTU table to Distance table

Given an OTU table with \(n\) samples and \(p\) variables (OTUs): \(\mathbf{C}\) the first task is to calculate all \(n(n-1)/2\) pairwise distances and represent those in a symmetric \(n\) by \(n\) matrix \(\mathbf{D}\).

There are several different ways to calculate the beta diversity. Check out the distanceMethodList to see those supported by phyloseq.

These can in general be grouped according to two criterions:

A thorough description can be found else where. Here we are going to focus on four common methods, namely Jaccard, Bray Curtis, weighted- and unweighted UNIFRAC, as these methods span the criterions mentioned above.

Let \(a\) and \(b\) be count vectors for two samples with equally many features (OTUs)

Bray Curtis vs Jaccard

The Jaccard’s index solely focus on presence/absence and hence do not distinguis a OTU on its abundance. On the other hand Bray Curtis focuses on abundance.

Jaccard

Jaccard’s index is defined as follows

\[S_{a,b} = \frac{a>0 \cap b>0}{a>0 \cup b>0} = \frac{\#common OTUs}{\#total OTUs}\]

Bray Curtis

Bray Curtis dissimilarity index relates to the so-called Manhatten distance, and measures, for each OTU the descripancy in abundance. Normalized, and summed

\[BC_{a,b} = \frac{\sum |a_i - b_i |}{\sum (a_i + b_i)} \] for \(i = 1,...,p\)

Two observations:

For OTU’s absent in both samples \(|a_i - b_i | = 0\) which implies that BC only depends on observed OTUs. BC is insensitive towards total sum normalizion, as this is an inherent part of the denominator.

UNIFRAC

Both, Jaccard and Bray Curtis makes a flat comparison of the samples, meaning, that any two different OTUs are alike. This, however we know is not the case, as OTUs belonging to the same taxonomic group (say Family) are more similar than two belonging to different groups. Utilizing this actively in the computation of similarities have some advances, as it removes uncertainty due to somewhat wrong annotation, reduces emphasis on differences between phylogenetic similar OTUs and enhances differences due to diffences on a diverse phylogenetic level.

In line with Jaccard / Bray Curtis, Unifrac can be computed focusing on presence/absence as well as abundance. These are called unweigted and weighted unifrac respectively.

From Distance table to PCoA plot

Beta diversity measures has a similar structure as correlation / covariance matrices. Such matrices can be projected into a low dimensionality representation which can be visualized, showing the largest common variation from complex data in a simple plot. These plots are often refered to as ordination plots. For theory on such dimensionality reduction look up Principal Component Analysis (PCA), eigenvalue/eigenvector decomposition and multidimensional scaling.

The function ordinate() and plot_ordination() from phyloseq are very useful for beta diversity plotting of microbiome data.

library(phyloseq)
library(ggplot2)
load('./data/Rats_inulin.RData')
ordBC <- ordinate(phyX, "PCoA", "bray")
plot_ordination(phyX,ordBC, color = 'Description') + 
  stat_ellipse() + 
  facet_wrap(~time) + 
  theme(legend.position = 'bottom')

The plot_ordination() is in synch with ggplot2, why ggplot2 functionality adapts nicely.

The interpretation of ordination plots are simple: Two points in close proximity are alike, while those distant are not alike.

1 Exercise

1.1 Four different diversity measures

Compute all four diversity measures directly on the Rat dataset. You might want to include some preprocessing.

library(phyloseq)
load('./data/Rats_inulin.RData'); 

ordBC <- ordinate(phyX, "PCoA", "bray")
ordJC <- ordinate(phyX, "PCoA", "jaccard")
ordUF <- ordinate(phyX, "PCoA", "unifrac")
ordwUF <- ordinate(phyX, "PCoA", "wunifrac")

smpID <- sample_data(phyX)$X.SampleID

df <- rbind(data.frame(ordBC$vectors[,1:4],X.SampleID = smpID, method = 'BC'),
            data.frame(ordJC$vectors[,1:4],X.SampleID = smpID,method = 'Jaccard'),
            data.frame(ordUF$vectors[,1:4],X.SampleID = smpID,method = 'unifrac'),
            data.frame(ordwUF$vectors[,1:4],X.SampleID = smpID,method = 'wunifrac'))

# add sample_data info
df <- merge(df, data.frame(sample_data(phyX)), by = 'X.SampleID')


g1 <- ggplot(data = df, aes(Axis.1,Axis.2, 
                      color = factor(Description),
                      shape = time, 
                      group = factor(Description):factor(time))) + 
  geom_point() + 
  stat_ellipse() + 
  facet_wrap(~method,scales = 'free')

g1 +geom_hline(yintercept = 0) + geom_vline(xintercept = 0)  + theme_minimal()

… If you want to save the (last produced) plot

ggsave('ThisIsFigure2formanus.pdf')

1.2 Sensitive towards normalization

Based on the formulation, which of the methods are sensitive towards total sum normalization?

1.3 Compare methods

Compare the different similarity methods by vectorizing and correlating the distance objects.

D_BC <- phyloseq::distance(phyX, "bray")
D_JC <- phyloseq::distance(phyX, "jaccard")
D_UF <- phyloseq::distance(phyX,  "unifrac")
D_wUF <- phyloseq::distance(phyX, "wunifrac")
dist_df <- data.frame(bray = as.vector(D_BC), 
           jaccard = as.vector(D_JC), 
           unifrac = as.vector(D_UF), 
           wunifrac = as.vector(D_wUF))

cor(dist_df)

1.4 Plot it!! with design

The experiment is longitudinal with before and after diet (sausage) intervention. Try to infer this on the PCoA plot.

ggplot(data = df, aes(Axis.1,Axis.2, 
                      color = factor(Description),
                      shape = time, 
                      group = ID)) + 
  geom_point() +
  geom_line() + 
  stat_ellipse(aes(group = factor(Description):factor(time))) + 
  facet_wrap(~method,scales = 'free')

1.5 Within subject variation

Are there any tendencies towards each rat being more similar with it self over time?

1.6 Cowboy linear models

Try to compute a normal linear anova model with the component axis (1,2,..) as response dependent on time and rat ID.

mdl <- lm(data = df[df$method=='Jaccard',],Axis.1~ID + time*Description)
anova(mdl)

1.6.1 The tidyverse way

There is a new R-modelling paragdigme making it easy to do many things at once. If you find the coding trivial, it might be worth looking into, as it - after some headache - makes life easier. However, it is not a microbiome-thing, only R, and you can achieve the same with the more basic commands. Below is shown how the linear models from the chunk above is computed for all methods using the tidyverse and broom packages.

library(tidyverse)
library(broom)
df %>% 
  group_by(method) %>% 
  do(lm(data = ., Axis.1~ID + time*Description) %>% anova %>% tidy)
## # A tibble: 16 × 7
## # Groups:   method [4]
##    method   term                df  sumsq   meansq statistic   p.value
##    <chr>    <chr>            <int>  <dbl>    <dbl>     <dbl>     <dbl>
##  1 BC       ID                  29 0.519  0.0179        3.52  8.74e- 4
##  2 BC       time                 1 3.07   3.07        603.    1.61e-19
##  3 BC       time:Description     2 0.578  0.289        56.8   3.22e-10
##  4 BC       Residuals           26 0.132  0.00508      NA    NA       
##  5 Jaccard  ID                  29 0.471  0.0162        2.65  7.06e- 3
##  6 Jaccard  time                 1 3.01   3.01        492.    2.03e-18
##  7 Jaccard  time:Description     2 0.538  0.269        43.9   4.59e- 9
##  8 Jaccard  Residuals           26 0.159  0.00613      NA    NA       
##  9 unifrac  ID                  29 0.191  0.00657       3.48  9.62e- 4
## 10 unifrac  time                 1 1.13   1.13        597.    1.84e-19
## 11 unifrac  time:Description     2 0.159  0.0794       42.0   7.13e- 9
## 12 unifrac  Residuals           26 0.0491 0.00189      NA    NA       
## 13 wunifrac ID                  29 0.0552 0.00190       2.75  5.46e- 3
## 14 wunifrac time                 1 0.0976 0.0976      141.    5.29e-12
## 15 wunifrac time:Description     2 0.0217 0.0109       15.7   3.40e- 5
## 16 wunifrac Residuals           26 0.0180 0.000692     NA    NA

This is kinda add hoc, as the composition is captured in not-only a single component. I.e. instead of using Axis.1 we would like to use all the Axes,… or the entire distance matrix. This is exactly the purpose of permanova. See more under Statistical inference -> ANOVA on beta diversity.

References

Chen, Jun, Kyle Bittinger, Emily S. Charlson, Christian Hoffmann, James Lewis, Gary D. Wu, Ronald G. Collman, Frederic D. Bushman, and Hongzhe Li. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics 28, no. 16 (2012): 2106-2113.

Schmidt, Thomas Sebastian Benedikt, João Frederico Matias Rodrigues, and Christian Von Mering. A family of interaction-adjusted indices of community similarity The ISME journal 11, no. 3 (2017): 791.

Lozupone, Catherine, and Rob Knight. UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol. 71, no. 12 (2005): 8228-8235.

Xia, Yinglin, Jun Sun, and Ding-Geng Chen. Statistical analysis of microbiome data with R. Springer, 2018. Chapter 6.