Alpha diversity descripes the diversity within a sample. Beta diversity, on the other hand, descriptes the diversity between samples. This implies that beta diversity measures can not be calculated for a single sample, but is calculated for all pairs of samples in the dataset.
So, if you have 10 samples, then there is \(9 + 8 + .... + 1 = 45\) beta diversity measures!
Each beta diversity measure measures the similarity between the corresponding two samples. That is a distance in microbiome space. I.e. is \(\beta diversity_{i,j} = 0\) for sample \(i\) and \(j\) then the two samples are exactly similar. If \(\beta diversity_{i,j} >> 0\) is large, then the two samples are very distinct.
Beta diversity measures are by definition non-negative numbers.
While this can seem rather complex, combining beta diversity measures with dimentionallity reduction models such as Principal Component Analysis, comprise a very powerful tool for visualizing similarities and sisimilarities between samples.
Given an OTU table with \(n\) samples and \(p\) variables (OTUs): \(\mathbf{C}\) the first task is to calculate all \(n(n-1)/2\) pairwise distances and represent those in a symmetric \(n\) by \(n\) matrix \(\mathbf{D}\).
There are several different ways to calculate the beta diversity. Check out the distanceMethodList to see those supported by phyloseq.
These can in general be grouped according to two criterions:
Sensitivity towards abundance or presence absense.
Incorporation of phylogentic similarity.
A thorough description can be found else where. Here we are going to focus on four common methods, namely Jaccard, Bray Curtis, weighted- and unweighted UNIFRAC, as these methods span the criterions mentioned above.
Let \(a\) and \(b\) be count vectors for two samples with equally many features (OTUs)
The Jaccard’s index solely focus on presence/absence and hence do not distinguis a OTU on its abundance. On the other hand Bray Curtis focuses on abundance.
Jaccard’s index is defined as follows
\[S_{a,b} = \frac{a>0 \cap b>0}{a>0 \cup b>0} = \frac{\#common OTUs}{\#total OTUs}\]
Bray Curtis dissimilarity index relates to the so-called Manhatten distance, and measures, for each OTU the descripancy in abundance. Normalized, and summed
\[BC_{a,b} = \frac{\sum |a_i - b_i |}{\sum (a_i + b_i)} \] for \(i = 1,...,p\)
Two observations:
For OTU’s absent in both samples \(|a_i - b_i | = 0\) which implies that BC only depends on observed OTUs. BC is insensitive towards total sum normalizion, as this is an inherent part of the denominator.
Both, Jaccard and Bray Curtis makes a flat comparison of the samples, meaning, that any two different OTUs are alike. This, however we know is not the case, as OTUs belonging to the same taxonomic group (say Family) are more similar than two belonging to different groups. Utilizing this actively in the computation of similarities have some advances, as it removes uncertainty due to somewhat wrong annotation, reduces emphasis on differences between phylogenetic similar OTUs and enhances differences due to diffences on a diverse phylogenetic level.
In line with Jaccard / Bray Curtis, Unifrac can be computed focusing on presence/absence as well as abundance. These are called unweigted and weighted unifrac respectively.
Beta diversity measures has a similar structure as correlation / covariance matrices. Such matrices can be projected into a low dimensionality representation which can be visualized, showing the largest common variation from complex data in a simple plot. These plots are often refered to as ordination plots. For theory on such dimensionality reduction look up Principal Component Analysis (PCA), eigenvalue/eigenvector decomposition and multidimensional scaling.
The function ordinate() and plot_ordination() from phyloseq are very useful for beta diversity plotting of microbiome data.
library(phyloseq)
library(ggplot2)
load('./data/Rats_inulin.RData')
ordBC <- ordinate(phyX, "PCoA", "bray")
plot_ordination(phyX,ordBC, color = 'Description') +
stat_ellipse() +
facet_wrap(~time) +
theme(legend.position = 'bottom')
The plot_ordination() is in synch with ggplot2, why ggplot2 functionality adapts nicely.
The interpretation of ordination plots are simple: Two points in close proximity are alike, while those distant are not alike.
Compute all four diversity measures directly on the Rat dataset. You might want to include some preprocessing.
library(phyloseq)
load('./data/Rats_inulin.RData');
ordBC <- ordinate(phyX, "PCoA", "bray")
ordJC <- ordinate(phyX, "PCoA", "jaccard")
ordUF <- ordinate(phyX, "PCoA", "unifrac")
ordwUF <- ordinate(phyX, "PCoA", "wunifrac")
smpID <- sample_data(phyX)$X.SampleID
df <- rbind(data.frame(ordBC$vectors[,1:4],X.SampleID = smpID, method = 'BC'),
data.frame(ordJC$vectors[,1:4],X.SampleID = smpID,method = 'Jaccard'),
data.frame(ordUF$vectors[,1:4],X.SampleID = smpID,method = 'unifrac'),
data.frame(ordwUF$vectors[,1:4],X.SampleID = smpID,method = 'wunifrac'))
# add sample_data info
df <- merge(df, data.frame(sample_data(phyX)), by = 'X.SampleID')
g1 <- ggplot(data = df, aes(Axis.1,Axis.2,
color = factor(Description),
shape = time,
group = factor(Description):factor(time))) +
geom_point() +
stat_ellipse() +
facet_wrap(~method,scales = 'free')
g1 +geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + theme_minimal()
… If you want to save the (last produced) plot
ggsave('ThisIsFigure2formanus.pdf')
Based on the formulation, which of the methods are sensitive towards total sum normalization?
Compare the different similarity methods by vectorizing and correlating the distance objects.
D_BC <- phyloseq::distance(phyX, "bray")
D_JC <- phyloseq::distance(phyX, "jaccard")
D_UF <- phyloseq::distance(phyX, "unifrac")
D_wUF <- phyloseq::distance(phyX, "wunifrac")
dist_df <- data.frame(bray = as.vector(D_BC),
jaccard = as.vector(D_JC),
unifrac = as.vector(D_UF),
wunifrac = as.vector(D_wUF))
cor(dist_df)
The experiment is longitudinal with before and after diet (sausage) intervention. Try to infer this on the PCoA plot.
ggplot(data = df, aes(Axis.1,Axis.2,
color = factor(Description),
shape = time,
group = ID)) +
geom_point() +
geom_line() +
stat_ellipse(aes(group = factor(Description):factor(time))) +
facet_wrap(~method,scales = 'free')
Are there any tendencies towards each rat being more similar with it self over time?
Try to compute a normal linear anova model with the component axis (1,2,..) as response dependent on time and rat ID.
mdl <- lm(data = df[df$method=='Jaccard',],Axis.1~ID + time*Description)
anova(mdl)
There is a new R-modelling paragdigme making it easy to do many things at once. If you find the coding trivial, it might be worth looking into, as it - after some headache - makes life easier. However, it is not a microbiome-thing, only R, and you can achieve the same with the more basic commands. Below is shown how the linear models from the chunk above is computed for all methods using the tidyverse and broom packages.
library(tidyverse)
library(broom)
df %>%
group_by(method) %>%
do(lm(data = ., Axis.1~ID + time*Description) %>% anova %>% tidy)
## # A tibble: 16 × 7
## # Groups: method [4]
## method term df sumsq meansq statistic p.value
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 BC ID 29 0.519 0.0179 3.52 8.74e- 4
## 2 BC time 1 3.07 3.07 603. 1.61e-19
## 3 BC time:Description 2 0.578 0.289 56.8 3.22e-10
## 4 BC Residuals 26 0.132 0.00508 NA NA
## 5 Jaccard ID 29 0.471 0.0162 2.65 7.06e- 3
## 6 Jaccard time 1 3.01 3.01 492. 2.03e-18
## 7 Jaccard time:Description 2 0.538 0.269 43.9 4.59e- 9
## 8 Jaccard Residuals 26 0.159 0.00613 NA NA
## 9 unifrac ID 29 0.191 0.00657 3.48 9.62e- 4
## 10 unifrac time 1 1.13 1.13 597. 1.84e-19
## 11 unifrac time:Description 2 0.159 0.0794 42.0 7.13e- 9
## 12 unifrac Residuals 26 0.0491 0.00189 NA NA
## 13 wunifrac ID 29 0.0552 0.00190 2.75 5.46e- 3
## 14 wunifrac time 1 0.0976 0.0976 141. 5.29e-12
## 15 wunifrac time:Description 2 0.0217 0.0109 15.7 3.40e- 5
## 16 wunifrac Residuals 26 0.0180 0.000692 NA NA
This is kinda add hoc, as the composition is captured in not-only a single component. I.e. instead of using Axis.1 we would like to use all the Axes,… or the entire distance matrix. This is exactly the purpose of permanova. See more under Statistical inference -> ANOVA on beta diversity.
Xia, Yinglin, Jun Sun, and Ding-Geng Chen. Statistical analysis of microbiome data with R. Springer, 2018. Chapter 6.