16S rRNA gene amplicon sequencing preliminary analysis, with differential abundance testing, report for Project ID: Example_001
CoMS (analyst: Garrett Smith, smith.10284)
Project Description
Objective
Standard analysis of 16S rRNA gene amplicon sequencing done by the Center of Microbiome Sciences (CoMS) at OSU.
Workflow overview
This analysis begins after a successful read QC, ASV identification and taxonomic assignment of 16S rRNA gene amplicon reads.
ASVs, Amplicon Sequence Variants, are the new OTUs. The standard analysis of 16S sequencing, the generaion OTUs, clusters of sequences with shared identity (e.g., ~97% nucleotide identity typically) around a centroid sequence and largely ignorant of the likely greater distance than set between members of the cluster, is falling out of favor. In it's place ASVs are emerging, with sequencing and analysis has improved such that errors in the reads can be modeled and ultimately used to get the exact sequences, and therefore has no need for clustering to encompass slighlty erroneous sequences. No method is yet robust for handling microbes with multiple 16S sequences at scale and without the this information derived from genomic references, which 16S is used to work around.
DADA2 ( https://benjjneb.github.io/dada2/bigdata_paired.html) was run on reads following a modified big data workflow, using a one-by-one pseudo-pooling approach for ASV inference and chimera detection, with the following parameters (anything not listed was left at defaults):
filter and trim parameter | forward reads | reverse reads |
---|---|---|
maxN | 0 | 0 |
maxEE | 2 | 3 |
trimleft | 17 | 21 |
truncLen | 270 | 250 |
Taxonomy was determined using SILVA's non-redundant at 99%, with species, database constructed for DADA2. ASVs were output as their sequences, a table containing both the taxonomy string and counts, and a summary of the reads at each step in processing. In this report, we focus on the ASV counts table and QC summary files.
First, data are massaged. Some ASVs and their reads removed from downstream analyses due to either lacking Phylum-level taxonomic inference, affiliation with plastids (chloroplast, mitochondria), extremely low abundances, or identification as a likely contaminant using the decontam R package (https://benjjneb.github.io/decontam/vignettes/decontam_intro.html). When possible, decontam is used in "combined" mode that runs an additional statistical test for both the frequency and prevalence methods, otherwise we use whatever is possible for frequency (using read counts correlated to quantification values) or prevalence (using presence/absence in control vs real samples). If experimental design is insufficient for decontam, any ASV in a control sample is assumed to be a contaminant. After removal of these contaminants, counts are converted into relative abundances (counts per ASV per sample / total counts for all ASVs per sample), and additional tables are written.
Second, these massaged data are visualized. These visualizations begin with an overview of the reads lost during QC/processing, the removal of contaminants is overviewed, the relationship between reads and ASV recovery (there generally should not be one), and the taxonomic assignment confidence. These QC/processing visualizations are followed by the 'core' microbiome analyses: alpha-diversity (various indices, only ASVs), beta-diversity (NMDS) with structuring members at each taxonomic level, compositional stacked bar charts (and bubble plots, if desired) at each taxonomic level, ranked-abundance charts of the top 10 members at each taxonomic level, and lastly differential abundance testing using ANCOM-BC (https://www.bioconductor.org/packages/release/bioc/vignettes/ANCOMBC/inst/doc/ANCOMBC.html) at each taxonomic level. For all taxonomic-level analyses, the read counts/relative abundances and number of ASVs are aggregated without the unassigned members. Unassigned members are excluded because, for example, 100 ASVs with an unassigned species may represent anywhere between 0 and 100 species distinct from the other species of the same genus, and therefore could create an artificial super-taxon.
Lastly, tables are provided for easy perusal of the top 10 and differentially abundant members at each taxonomic level.
Accessing the data
All data resulting from DADA2, as input here, are stored in a dedicated location as:
- ASV_decontam.tsv : p-values for each test for ASVs identified as contaminants using the R package decontam.
- ASV_relabun_nocontam.tsv : Table of relateive abundances per ASV per sample that were not identified as contaminants (each column of sample should sum to 1).
- ASV_counts_nocontam.tsv : Table of read counts per ASV per sample that were not identified as contaminants.
- ASVs_nocontam.txt : A list of ASVs that were not identified as contaminants.
- ASV_summary_nocontam.tsv : Summary of read counts during QC and processing with new column that includes reads to ASVs not identified as contaminants
- alpha.tsv: alpha diversity indices as a table
- alpha.pdf: alpha diversity values plot
- richness.pdf: richness values plot
- beta.pdf: beta diversity plot
- composition_all.pdf: compositional bar charts plot
- ranks_all.pdf: rank abundance curves plot
- diff_[taxon level].tsv : Tables of differentially abundant taxa from the primary tests of ANCOM-BC, if nay were identified
- glbldiff_[taxon level].tsv : Tables of differentially abundant taxa from the global tests of ANCOM-BC, if nay were identified (minimum 3x groups)
1. QC and processing visualizations
There are several steps from processing reads to getting to the "OTU table". The first steps are filtering and trimming low quality (portions of), de-replication, identification of ASVs, merging forward and reverse, and lastly identification of chimeric reads. At each step (excluding ASV inference), some reads are (very likely) to be discarded.
1A. Reads retained during processing
These plots shows the data loss during QC and processing.
In the first, the the x-axis is the number of reads, y-axis is the sample, and each colored bar within a sample is a stage of QC/processing, with the percent of raw reads ("in", lightest green) indicated near the top of each bar. Note, only a maximum of 5 samples per grouping are shown for simplicity.
In the second, the data are reconfigured into a boxplot with the x-axis being the QC step, the y-axis being number of reads, and a point for each sample overlaid with controls and real samples in separate panels by grouping ("NA" = controls).
1B. Contaminant identification and removal
These plots shows the types of contaminants identified and the amount of ASVs and reads per these several different categories including lacking taxonomic affiliation at the Phylum level, taxonomy consistent with plastids, extremely low abundances, or identification modeled by the package decontam. Ideally, a small amount of reads and ASVs would be removed.
In the first, the y-axis is the sample, the x-axis is the source of contamination for 5 samples per group dsiplayed. Each point is sized by the number of ASVs identified and removed, and then colored by the total number of reads assigned to those ASVs.
In the second and third plots, box plots colored by grouping are shown for all samples the types of contaminants that were removed and the ASVs (first) and the reads (second) assigned to them on the y-axis. The last box on the x-axis shows the # ASVs or reads that were not considered contaminants, representing the real data that will be analyzed.
1C. Taxonomy assignment confidence
When taxonomy is assigned, there is a confidence value (here, bootstraps) for each level from domain to species (when dada2 is run a certain way). We by default use the 'silva_nr99_v138.1_wSpecies_train_set' database, which will assign microbial taxonomy from Domain to species in one step, usign the SILVA database (informed by genomic information determined using the GTDB), which is version 138 sequences that were clustered at 99% nucleotid identity. If a different database is desired, we have most available listed here - https://benjjneb.github.io/dada2/training.html . Despite achieving a taxonomic assignment to a species (or sometimes even more resolved level such as subspecies) is possible, and can even have a high confidence value, most assignments of species are not considered fully reliable. This plot shows the distributions of these confidence values. Each vertical panel represents the taxonomic level (e.g., phylum, class, etc.) and the horizontal panel indicates whether or not the ASV was a contaminant. The x-axis within a panel is a phylum, and the y-axis is the bootstrap, or confidence, value (closer to 100 is more confident). Each colored point represents a single ASV at each taxonomic level, with the gray point and vertical line indicating the mean and standard error. Additionally, for each taxon level per phylum the number of ASVs with a bootstrap value at least 50, the cutoff for assignment, is indicated as text. Typically, higher levels of taxonomy can be more confidently assigned.
1D. Reads vs. recovery
These next three plots show the relationship between the number of reads and the number of ASVs recovered per sample. Ideally, there is NO strong relationship as the number of reads is technical, and the number of ASVs biological.
In the first plot, the x-axis is the number of reads and the y-axis is the richness, or number of ASVs, and each point represents a sample. The points are backed by a linear regression line with shaded confidence intervals and text in the upper left corner displaying the support values, caomputed only for the entire dataset. Ideally, the points do not fit the lines well and the shaded regions are large, the adjusted R^2 value is low and the p value high.
The second is a rarefaction curve, which shows the number of taxa recovered at sub-samples of reads, such that the x-axis is the sequencing effort and the y-axis the the number of ASVs recovered (the gray helps the eye follow the curves the sample labels on the right). Ideally, each sample's curve plateaus so that no sample is under-sampled. The vertical dashed black line indicates the minimum depth of contaminant-removed reads.
2. Core microbiome analyses
There are many analyses that can be done with 16S rRNA gene amplicon datasets, but some are far more popular than others. The most common include alpha-diversity (i.e., within sample; e.g., Shannon's Entropy or Chao1) and beta-diversity (i.e., between sample distance; e.g., Bray-Curtis dissimilarity represented as an NMDS) of the communities, bar charts showing the proportional composition of community members, and ranked-abundance curves of the most/least abundant community members.
2A. Alpha diversity
There are many ways to calculate alpha diversity. Many are implemented as part of a suite of tools, some include phylogenetic information, i.e., are additionally informed by phylogenetic trees, but all tend to show similar patterns that hint at how many community members (richness) there are and their abundance relative to each other (evenness/dominance). Here, we calculate 4 estimates of alpha-diversity per sample, including richness (simply, the number of ASVs or assigned taxa), Shannon's Entropy, Pielou's Evenness, and both regular and inverse Simpson indices, for all levels of taxonomy. Each one has its limitations, but should follow the same trends, so everyone tends to have their own favorite(s). This plot is separated into vertical panels for each of the five indices. The x-axis is the value for that index, and the y-axis is the sample, and points are colored by their group. For all these indices, the higher the value, the more "diverse". Note, that only a maximum of 10 samples are shown. The right plots show the diversity indices as boxplots for the full dataset, where each panel and x-axis is one index and the y-axis is the value and colored by their group. Wile provided, largely given apparent use in field, "diversity" at higher levels of taxonomy, e.g. Class and Phylum, are not that meaningful.
2B. Beta diversity
Beta diversity, like alpha, can also be examined in a variety of ways using different calculations to, generally, arrive at the same conclusion (thankfully!), which is how samples cluster. Similar to a PCA/PCoA, the closer (or less distant) samples are to each other, the more similar they are, and thus the lower beta diversity. Important note: ASVS with mean relative abundance <1*10^-5 were excluded from these analyses, though they were not identified as likely contaminants with low abundance may introduce excessive noise.
This first plot shows the beta diversity, or clustering, of samples. The NMDS was solved from Bray-Curtiss dissimilarity matrices, which are robust for 0-inflated compositional datasets (such as microbiome sequencing). Each point is a sample, colored by grouping, and both axes are a non-metric dimension (i.e., they don't actually mean anything outside this plot). The stress, or confidence in the overall clustering, is displayed as text in the lower left corner, generally below 0.2 is a decent fit and the closer to 0 the better. Samples that are near each other have a more similar composition, and the statistical support for similarity between the defined groups as estimated by ANOSIM and ADONIS2 (i.e., PERMANOVA) are in the lower left corner.
The second series of plots shows certain taxa at various taxonomic levels with significant (R > 0.1, p < 0.05) fits to the ordination.The points match the NMDS, and arrows are shown for each distinct taxon, colored by phylum, and the distance from the center scaled by its R value such that longer arrows had greater impacts on the clustering.
2C. Compositional bar charts
One of the most popular visualizations of microbiome data is the stacked bar charts, in which the relative abundance of each community member is colored and stacked, summed to 1 or 100%, and displayed side by side so general trends in changes, or lack thereof, in composition can be inferred. A weakness of these visualizations is that it's tough to get the details out of them; if the details as to which specific taxa are likely important, another analyses are better suited (e.g., rank abundance curve, heatmap, etc.). Thus, the main difference between each plot will simply be the number of breaks in the columns. Important note: ASVS with mean relative abundance <1*10^-5 were excluded from these analyses, though they were not identified as likely contaminants with low abundance may introduce excessive noise.
These series of plots show the composition at various levels of taxon resolution (i.e., phylum, class, etc.). For each, there are two panels showing essenitallyl the same information horizontally. On the left, the x-axis is the relative abundance and the y-axis is up to 5 samples, with vertical separation of the groups. Each microbial group makes up an individual rectangle, colored by its phylum. Lastly, the samples are seperated into their groups. More resolved taxa are not colored here because there may be many, for which no color palette would be suitable and would hinder interpretation. Generally, while there may be many members of a given group they are rarely all at similar abundances, i.e., one or a handful tend to dominate. For the more finely resolved taxa levels (species, genus, family), white bars are overlaid on members that are "unassigned" at that taxonomic level. On the right is this same information, but showing the average of each taxonomic unit for each group, and flipped so that the bars are vertical.
Here, a re-sortable table to match the figure for the phyla is available for simple perusal.
2D. Ranked-abundance curves
Ranked-abundance curves show the head (first/top/most abundant) and/or tail (last/lowest/least abundant) members of the communities. Often, but not always, abundant members are abundant for a reason, such as their carbon/electron/nitrogen/etc. source is abundant and have little competition. On the other hand, low abundance members can fulfill critical niches, such as the production of vitamins or amino acids that other community members lack the capacity to. These functional traits are not accessible directly with a 16S rRNA gene amplicon sequencing experiment, but certain patterns may be inferred with sufficient additional data or phylogenetic proximity to a characterized neighbor. Regardless, for the purposes here, the most abundant members are shown assuming that their high relative abundances were observed for some reason, even if it is uncertain. These series of plots show the top 10 members in each grouping at various levels of taxon resolution (i.e., phylum, class, etc.). For each, the x-axis the are the members, sorted in order of decreasing mean relative abundance for the full dataset, and the y-axis is the relative abundance. The bars show the mean, with standard errors as errorbars, and individual samples are shown as points, separated and colored by the group, with the inclusion of the full dataset as grey. Corresponding tables appear towards end of file.
2F. Differential abundances
For the most part, the goal of 16S rRNA gene amplicon sequencing study is to associate taxa to a change in phenotype, often by testing for differences in relative abundances, and here we used ANCOM-BC. At Phylum to species levels tests, the ASVs that were unassigned were excluded and the rest were aggregated. Only taxa that were significantly different are shown, and may be filtered to shown only the taxa with the greatest log-fold changes. Note, global test results are only available when testing for diferences among 3 or more groups.
3. Tables
3A. top 10 (matching ranked-abundance plots)
Here, re-sortable tables are available to help describe the data for the top 10 members at different taxonomic levels.
top 10 Phyla
top 10 Classes
top 10 Orders
top 10 Families
top 10 Genera
top 10 species
top 10 ASVs
3B. differentially abundant taxa
Here, re-sortable tables are available to help describe the data from the differential abundance testing. Note, not all taxonomic anks #will have differentially abundant taxa (table with 0 entries), and all differentially abundant taxa should be checked.
diff. abun. Phyla
diff. abun. Classes
diff. abun. Orders
diff. abun. Families
diff. abun. Genera
diff. abun. species
diff. abun. ASVs