Exploring thematic structure and predicted functionality of 16S rRNA amplicon data

Autoři: Stephen Woloszynek aff001;  Joshua Chang Mell aff002;  Zhengqiao Zhao aff001;  Gideon Simpson aff003;  Michael P. O’Connor aff004;  Gail L. Rosen aff001
Působiště autorů: Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America aff001;  Department of Microbiology and Immunology, Drexel University College of Medicine, Philadelphia, Pennsylvania, United States of America aff002;  Department of Mathematics, Drexel University, Philadelphia, Pennsylvania, United States of America aff003;  Department of Biodiversity, Earth, and Environmental Science, Drexel University, Philadelphia, Pennsylvania, United States of America aff004
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article
doi: 10.1371/journal.pone.0219235


Analysis of microbiome data involves identifying co-occurring groups of taxa associated with sample features of interest (e.g., disease state). Elucidating such relations is often difficult as microbiome data are compositional, sparse, and have high dimensionality. Also, the configuration of co-occurring taxa may represent overlapping subcommunities that contribute to sample characteristics such as host status. Preserving the configuration of co-occurring microbes rather than detecting specific indicator species is more likely to facilitate biologically meaningful interpretations. Additionally, analyses that use taxonomic relative abundances to predict the abundances of different gene functions aggregate predicted functional profiles across taxa. This precludes straightforward identification of predicted functional components associated with subsets of co-occurring taxa. We provide an approach to explore co-occurring taxa using “topics” generated via a topic model and link these topics to specific sample features (e.g., disease state). Rather than inferring predicted functional content based on overall taxonomic relative abundances, we instead focus on inference of functional content within topics, which we parse by estimating interactions between topics and pathways through a multilevel, fully Bayesian regression model. We apply our methods to three publicly available 16S amplicon sequencing datasets: an inflammatory bowel disease dataset, an oral cancer dataset, and a time-series dataset. Using our topic model approach to uncover latent structure in 16S rRNA amplicon surveys, investigators can (1) capture groups of co-occurring taxa termed topics; (2) uncover within-topic functional potential; (3) link taxa co-occurrence, gene function, and environmental/host features; and (4) explore the way in which sets of co-occurring taxa behave and evolve over time. These methods have been implemented in a freely available R package: https://cran.r-project.org/package=themetagenomics, https://github.com/EESI/themetagenomics.

Klíčová slova:

Crohn's disease – Gene prediction – Metagenomics – Microbial taxonomy – Microbiome – Ribosomal RNA – Shotgun sequencing – Taxonomy


1. Kurtz ZD, Mueller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and Compositionally Robust Inference of Microbial Ecological Networks. PLoS Comput Biol. 2015;11: 1–25. doi: 10.1371/journal.pcbi.1004226 25950956

2. Shafiei M, Dunn KA, Boon E, MacDonald SM, Walsh DA, Gu H, et al. BioMiCo: a supervised Bayesian model for inference of microbial community structure. Microbiome. 2015;3: 8. doi: 10.1186/s40168-015-0073-x 25774293

3. Callahan BJ, Sankaran K, Fukuyama JA, McMurdie PJ, Holmes SP. Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000Research. 2016;5: 1492. doi: 10.12688/f1000research.8986.1 27508062

4. Knights D, Costello E, Knight R. Supervised classification of human microbiota. FEMS Microbiol Rev. 2011;35: 343–59. doi: 10.1111/j.1574-6976.2010.00251.x 21039646

5. Li H. Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis. Annu Rev Stat Its Appl. 2015;2: 73–94. doi: 10.1146/annurev-statistics-010814-020351

6. Gilbert JA, Quinn RA, Debelius J, Xu ZZ, Morton J, Garg N, et al. Microbiome-wide association studies link dynamic microbial consortia to disease. Nature. 2016;535: 94–103. doi: 10.1038/nature18850 27383984

7. McMurdie PJ, Holmes S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput Biol. 2014;10. doi: 10.1371/journal.pcbi.1003531 24699258

8. Love MI, Anders S, Huber W. Differential analysis of count data—the DESeq2 package [Internet]. Genome Biology. 2014.

9. Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010;33: 1–22. doi: 10.1359/JBMR.0301229 20808728

10. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67: 301–320. doi: 10.1111/j.1467-9868.2005.00503.x

11. Jiang X, Dushoff J, Chen X, Hu X. Identifying enterotype in human microbiome by decomposing probabilistic topics into components. 2012 IEEE Int Conf Bioinforma Biomed. Ieee; 2012; 1–4.

12. Ning J, Beiko RG. Phylogenetic approaches to microbial community classification. Microbiome. Microbiome; 2015;3: 47. doi: 10.1186/s40168-015-0114-5 26437943

13. Ren B, Bacallado S, Favaro S, Holmes S, Trippa L. Bayesian Nonparametric Ordination for the Analysis of Microbial Communities. arXiv Prepr arXiv160105156. 2016; 1–25. http://arxiv.org/abs/1601.05156

14. Langille MGI, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes J a, et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol. Nature Publishing Group; 2013;31: 814–21. doi: 10.1038/nbt.2676 23975157

15. Aßhauer KP, Wemheuer B, Daniel R, Meinicke P. Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data. Bioinformatics. 2015;31: 2882–2884. doi: 10.1093/bioinformatics/btv287 25957349

16. Iwai S, Weinmaier T, Schmidt BL, Albertson DG, Poloso NJ, Dabbagh K, et al. Piphillin: Improved prediction of metagenomic content by direct inference from human microbiomes. PLoS One. 2016;11: 1–18. doi: 10.1371/journal.pone.0166104 27820856

17. Edgar RC. SINAPS: Prediction of microbial traits from marker gene sequences. bioRxiv. 2017; doi: 10.1101/124156

18. Knights D, Kuczynski J, Charlson ES, Zaneveld J, Mozer MC, Collman RG, et al. Bayesian community-wide culture-independent microbial source tracking. Nat Methods. 2013;8: 761–763.

19. Blei DM, Lafferty JD. A correlated topic model of Science. Ann Appl Stat. 2007;1: 17–35. doi: 10.1214/07-AOAS136

20. Roberts ME, Stewart BM, Tingley D, Lucas C, Leder-Luis J, Gadarian SK, et al. Structural topic models for open-ended survey responses. Am J Pol Sci. 2014;58: 1064–1082. doi: 10.1111/ajps.12103

21. Gevers D, Kugathasan S, Denson LA, Vázquez-Baeza Y, Van Treuren W, Ren B, et al. The Treatment-Naive Microbiome in New-Onset Crohn’s Disease. Cell Host Microbe. 2014;15: 382–392. doi: 10.1016/j.chom.2014.02.005 24629344

22. Schmidt BL, Kuczynski J, Bhattacharya A, Huey B, Corby PM, Queiroz EL, et al. Changes in abundance of oral microbiota associated with oral cancer. PLoS One. 2014;9: e98741. doi: 10.1371/journal.pone.0098741 24887397

23. David LA, Materna AC, Friedman J, Baptista MIC, Blackburn MC, Perrotta A, et al. Host lifestyle affects human microbiota on daily timescales. Genome Biol. 2016;17: 117. doi: 10.1186/s13059-016-0988-y 27246704

24. Caporaso J, Kuczynski J, Stombaugh J, Bittinger K, Bushman. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2012;7: 335–336. doi: 10.1038/nmeth.f.303 20383131

25. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 1999. pp. 29–34. doi: 10.1093/nar/27.1.29 9847135

26. Legendre P, Legendre L. Numerical Ecology—Second English Edition. Developments in Environmental Modelling. 1998.

27. Hardoon DDR, Shawe-Taylor J. Sparse canonical correlation analysis. Mach Learn. 2011;10: 1–15. doi: 10.1007/s10994-010-5222-7

28. De Valpine P, Harmon-Threatt AN. General models for resource use or other compositional count data using the Dirichlet-multinomial distribution. Ecology. 2013;94: 2678–2687. doi: 10.1890/12-0416.1 24597215

29. Holmes I, Harris K, Quince C. Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS One. 2012;7. doi: 10.1371/journal.pone.0030126 22319561

30. Brien JDO, Record N. The power and pitfalls of Dirichlet-multinomial mixture models for ecological count data. bioRxiv. 2016; 1–22. doi: 10.1101/045468

31. Mimno D, McCallum A. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. arXiv Prepr arXiv12063278. 2012;

32. Kembel SW, Wu M, Eisen JA, Green JL. Incorporating 16S Gene Copy Number Information Improves Estimates of Microbial Diversity and Abundance. PLoS Comput Biol. 2012;8: 16–18. doi: 10.1371/journal.pcbi.1002743 23133348

33. Roberts, Margaret E., Stewart BM, Tingley D. stm: R Package for Structural Topic Models [Internet]. 2017. http://www.structuraltopicmodel.com.

34. Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. 2003;3: 993–1022.

35. Blei DM, McAuliffe JD, Blei DM. Supervised Topic Models. Adv Neural Inf Process Syst 20. 2008;21: 1–22.

36. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012;40. doi: 10.1093/nar/gkr988 22080510

37. Stan Development Team. rstanarm: Bayesian applied regression modeling via Stan [Internet]. 2016. http://mc-stan.org/

38. Gelman A, Rubin DB. Inference from Iterative Simulation Using Multiple Sequences. Stat Sci. 1992;7: 457–511. doi: 10.1214/ss/1177011136

39. Callahan BJ, Mcmurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2 : High resolution sample inference from amplicon data. bioRxiv. 2015;13: 0–14. doi: 10.1101/024034

Článek vyšel v časopise


2019 Číslo 12