Biomarker discovery in inflammatory bowel diseases using network-based feature selection

Autoři: Mostafa Abbas aff001;  John Matta aff002;  Thanh Le aff003;  Halima Bensmail aff001;  Tayo Obafemi-Ajayi aff003;  Vasant Honavar aff004;  Yasser EL-Manzalawy aff004
Působiště autorů: Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar aff001;  Department of Computer Science, Southern Illinois University Edwardsville, Edwardsville, IL, United States of America aff002;  Engineering Program, Missouri State University, Springfield, MO, United States of America aff003;  College of Information Sciences and Technology, Pennsylvania State University, University Park, PA, United States of America aff004;  Geisinger Health System, Danville, PA, United States of America aff005
Vyšlo v časopise: PLoS ONE 14(11)
Kategorie: Research Article
doi: 10.1371/journal.pone.0225382


Reliable identification of Inflammatory biomarkers from metagenomics data is a promising direction for developing non-invasive, cost-effective, and rapid clinical tests for early diagnosis of IBD. We present an integrative approach to Network-Based Biomarker Discovery (NBBD) which integrates network analyses methods for prioritizing potential biomarkers and machine learning techniques for assessing the discriminative power of the prioritized biomarkers. Using a large dataset of new-onset pediatric IBD metagenomics biopsy samples, we compare the performance of Random Forest (RF) classifiers trained on features selected using a representative set of traditional feature selection methods against NBBD framework, configured using five different tools for inferring networks from metagenomics data, and nine different methods for prioritizing biomarkers as well as a hybrid approach combining best traditional and NBBD based feature selection. We also examine how the performance of the predictive models for IBD diagnosis varies as a function of the size of the data used for biomarker identification. Our results show that (i) NBBD is competitive with some of the state-of-the-art feature selection methods including Random Forest Feature Importance (RFFI) scores; and (ii) NBBD is especially effective in reliably identifying IBD biomarkers when the number of data samples available for biomarker discovery is small.

Klíčová slova:

Biomarkers – Biopsy – Centrality – Inflammatory bowel disease – Metagenomics – Microbial ecology – Network analysis – Network resilience


1. Schmidt C, Stallmach A. Etiology and pathogenesis of inflammatory bowel disease. Minerva gastroenterologica e dietologica. 2005;51(2):127–145. 15990703

2. Van Assche G, Dignass A, Panes J, Beaugerie L, Karagiannis J, Allez M, et al. The second European evidence-based consensus on the diagnosis and management of Crohn’s disease: definitions and diagnosis. Journal of Crohn’s and Colitis. 2010;4(1):7–27. doi: 10.1016/j.crohns.2009.12.003

3. Gevers D, Kugathasan S, Denson LA, Vazquez-Baeza Y, Van Treuren W, Ren B, et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell host and microbe. 2014;15(3):382–392. doi: 10.1016/j.chom.2014.02.005

4. Kamada N, Seo SU, Chen GY, Nunez G. Role of the gut microbiota in immunity and inflammatory disease. Nature Reviews Immunology. 2013;13(5):321. doi: 10.1038/nri3430 23618829

5. Kostic AD, Xavier RJ, Gevers D. The microbiome in inflammatory bowel disease: current status and the future ahead. Gastroenterology. 2014;146(6):1489–1499. doi: 10.1053/j.gastro.2014.02.009 24560869

6. Manichanh C, Reeder J, Gibert P, Varela E, Llopis M, Antolin M, et al. Reshaping the gut microbiome with bacterial transplantation and antibiotic intake. Genome research. 2010. doi: 10.1101/gr.107987.110 20736229

7. Ruemmele FM, Targan SR, Levy G, Dubinsky M, Braun J, Seidman EG. Diagnostic accuracy of serological assays in pediatric inflammatory bowel disease. Gastroenterology. 1998;115(4):822–829. doi: 10.1016/s0016-5085(98)70252-5 9753483

8. Pascal V, Pozuelo M, Borruel N, Casellas F, Campos D, Santiago A, et al. A microbial signature for Crohn’s disease. Gut. 2017; p. gutjnl–2016. doi: 10.1136/gutjnl-2016-313235

9. Holtman GA, Lisman-van Leeuwen Y, Reitsma JB, Berger MY. Noninvasive tests for inflammatory bowel disease: a meta-analysis. Pediatrics. 2016;137(1):e20152126. doi: 10.1542/peds.2015-2126

10. Viennois E, Zhao Y, Merlin D. Biomarkers of inflammatory bowel disease: from classical laboratory tools to personalized medicine. Inflammatory bowel diseases. 2015;21(10):2467–2474. doi: 10.1097/MIB.0000000000000444 25985250

11. Shanahan F, Quigley EM. Manipulation of the microbiota for treatment of IBS and IBD: challenges and controversies. Gastroenterology. 2014;146(6):1554–1563. doi: 10.1053/j.gastro.2014.01.050

12. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, et al. Metagenomic biomarker discovery and explanation. Genome biology. 2011;12(6):R60. doi: 10.1186/gb-2011-12-6-r60 21702898

13. Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5(1):27. doi: 10.1186/s40168-017-0237-y 28253908

14. Anders S, Huber W. Differential expression analysis for sequence count data. Genome biology. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106 20979621

15. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616 19910308

16. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nature methods. 2013;10(12):1200. doi: 10.1038/nmeth.2658 24076764

17. Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial ecology in health and disease. 2015;26(1):27663.26028277

18. Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of machine learning research. 2003;3(Mar):1157–1182.

19. Abbas M, EL-Manzalawy Y. Predictive and Comparative Network Analysis of the Gut Microbiota in Type 2 Diabetes. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM; 2017. p. 313–320.

20. Abbas M, Le T, Bensmail H, Honavar V, El-Manzalawy Y. Microbiomarkers discovery in inflammatory bowel diseases using network-based feature selection. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM; 2018. p. 172–177.

21. Matta J, Obafemi-Ajayi T, Borwey J, Wunsch D, Ercal G. Robust graph-theoretic clustering approaches using node-based resilience measures. In: Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE; 2016. p. 320–329.

22. Ng SC, Shi HY, Hamidi N, Underwood FE, Tang W, Benchimol EI, et al. Worldwide incidence and prevalence of inflammatory bowel disease in the 21st century: a systematic review of population-based studies. The Lancet. 2017;390(10114):2769–2778. doi: 10.1016/S0140-6736(17)32448-0

23. Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, et al. Microbial co-occurrence relationships in the human microbiome. PLoS computational biology. 2012;8(7):e1002606. doi: 10.1371/journal.pcbi.1002606 22807668

24. Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS computational biology. 2012;8(9):e1002687. doi: 10.1371/journal.pcbi.1002687 23028285

25. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS computational biology. 2015;11(5):e1004226. doi: 10.1371/journal.pcbi.1004226 25950956

26. Meinshausen N, Bühlmann P, et al. High-dimensional graphs and variable selection with the lasso. The annals of statistics. 2006;34(3):1436–1462. doi: 10.1214/009053606000000281

27. Deng Y, Jiang YH, Yang Y, He Z, Luo F, Zhou J. Molecular ecological network analyses. BMC bioinformatics. 2012;13(1):113. doi: 10.1186/1471-2105-13-113 22646978

28. Faust K, Lima-Mendez G, Lerat JS, Sathirapongsasuti JF, Knight R, Huttenhower C, et al. Cross-biome comparison of microbial association networks. Frontiers in microbiology. 2015;6:1200. doi: 10.3389/fmicb.2015.01200 26579106

29. El-Manzalawy Y. Proxi: a Python package for proximity network inference from metagenomic data. bioRxiv. 2018; p. 357764.

30. Matta J, Obafemi-Ajayi T, Borwey J, Sinha K, Wunsch D, Ercal G. Node-Based Resilience Measure Clustering with Applications to Noisy and Overlapping Communities in Complex Networks. Applied Sciences. 2018;8(8):1307. doi: 10.3390/app8081307

31. Hagberg A, Swart P, Chult DS. Exploring network structure, dynamics, and function using NetworkX. Los Alamos National Lab.(LANL), Los Alamos, NM (United States); 2008.

32. Matta J, Ercal G, Borwey J. The vertex attack tolerance of complex networks. RAIRO-Operations Research. 2017;51(4):1055–1076. doi: 10.1051/ro/2017008

33. Ercal G. On Vertex Attack Tolerance in Regular Graphs. arXiv preprint arXiv:14092172. 2014.

34. Barefoot CA, Entringer R, Swart H. Vulnerability in graphs—a comparative survey. J Combin Math Combin Comput. 1987;1(38):13–22.

35. Cozzens M, Moazzami D, Stueckle S. The tenacity of a graph. In: Proc. Seventh International Conference on the Theory and Applications of Graphs, Wiley, New York; 1995. p. 1111–1122.

36. Matta J, Ercal G, Borwey J. The vertex attack tolerance of complex networks. RAIRO-Operations Research. 2017;51(4):1055–1076. doi: 10.1051/ro/2017008

37. Matta J. A Comparison of Approaches to Computing Betweenness Centrality for Large Graphs. In: International Workshop on Complex Networks and their Applications. Springer; 2017. p. 3–13.

38. Breiman L. Random forests. Machine learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324

39. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research. 2011;12(Oct):2825–2830.

40. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996; p. 267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x

41. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000;16(5):412–424. doi: 10.1093/bioinformatics/16.5.412 10871264

42. Ditzler G, Morrison JC, Lan Y, Rosen GL. Fizzy: feature subset selection for metagenomics. BMC bioinformatics. 2015;16(1):358. doi: 10.1186/s12859-015-0793-8 26538306

43. Pasolli E, Truong DT, Malik F, Waldron L, Segata N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS computational biology. 2016;12(7):e1004977. doi: 10.1371/journal.pcbi.1004977 27400279

44. Sokol H, Leducq V, Aschard H, Pham HP, Jegou S, Landman C, et al. Fungal microbiota dysbiosis in IBD. Gut. 2017;66(6):1039–1048. doi: 10.1136/gutjnl-2015-310746 26843508

45. Menon R, Ramanan V, Korolev KS. Interactions between species introduce spurious associations in microbiome studies. PLoS computational biology. 2018;14(1):e1005939. doi: 10.1371/journal.pcbi.1005939 29338008

46. Strauss J, Kaplan GG, Beck PL, Rioux K, Panaccione R, DeVinney R, et al. Invasive potential of gut mucosa-derived Fusobacterium nucleatum positively correlates with IBD status of the host. Inflammatory bowel diseases. 2011;17(9):1971–1978. doi: 10.1002/ibd.21606 21830275

47. Wang L, Christophersen CT, Sorich MJ, Gerber JP, Angley MT, Conlon MA. Increased abundance of Sutterella spp. and Ruminococcus torques in feces of children with autism spectrum disorder. Molecular autism. 2013;4(1):42. doi: 10.1186/2040-2392-4-42 24188502

48. Lavelle A, Lennon G, O’sullivan O, Docherty N, Balfe A, Maguire A, et al. Spatial variation of the colonic microbiota in patients with ulcerative colitis and control volunteers. Gut. 2015; p. gutjnl–2014. doi: 10.1136/gutjnl-2014-307873 25596182

49. Mukhopadhya I, Hansen R, Nicholl CE, Alhaidan YA, Thomson JM, Berry SH, et al. A comprehensive evaluation of colonic mucosal isolates of Sutterella wadsworthensis from inflammatory bowel disease. PLoS One. 2011;6(10):e27076. doi: 10.1371/journal.pone.0027076 22073125

50. Hiippala K, Kainulainen V, Kalliomäki M, Arkkila P, Satokari R. Mucosal Prevalence and Interactions with the Epithelium Indicate Commensalism of Sutterella spp. Frontiers in microbiology. 2016;7:1706. doi: 10.3389/fmicb.2016.01706 27833600

51. Machiels K, Joossens M, Sabino J, De Preter V, Arijs I, Eeckhaut V, et al. A decrease of the butyrate-producing species Roseburia hominis and Faecalibacterium prausnitzii defines dysbiosis in patients with ulcerative colitis. Gut. 2014;63(8):1275–1283. doi: 10.1136/gutjnl-2013-304833 24021287

52. Joossens M, Huys G, Cnockaert M, De Preter V, Verbeke K, Rutgeerts P, et al. Dysbiosis of the faecal microbiota in patients with Crohn’s disease and their unaffected relatives. Gut. 2011; p. gut–2010. doi: 10.1136/gut.2010.223263

53. Tye H, Yu CH, Simms LA, de Zoete MR, Kim ML, Zakrzewski M, et al. NLRP1 restricts butyrate producing commensals to exacerbate inflammatory bowel disease. Nature communications. 2018;9(1):3728. doi: 10.1038/s41467-018-06125-0 30214011

54. Delday M, Mulder I, Logan ET, Grant G. Bacteroides thetaiotaomicron ameliorates colon inflammation in preclinical models of Crohn’s disease. Inflammatory bowel diseases. 2018;25(1):85–96. doi: 10.1093/ibd/izy281

55. Konikoff T, Gophna U. Oscillospira: a central, enigmatic component of the human gut microbiota. Trends in microbiology. 2016;24(7):523–524. doi: 10.1016/j.tim.2016.02.015 26996766

56. Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics. 2003;4(1):2. doi: 10.1186/1471-2105-4-2 12525261

57. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research. 2003;13(11):2498–2504. doi: 10.1101/gr.1239303 14597658

58. Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome biology. 2012;13(9):R79. doi: 10.1186/gb-2012-13-9-r79 23013615

59. Goenawan IH, Bryan K, Lynn DJ. DyNet: visualization and analysis of dynamic molecular interaction networks. Bioinformatics. 2016;32(17):2713–2715. doi: 10.1093/bioinformatics/btw187 27153624

60. Duvallet C, Gibbons SM, Gurry T, Irizarry RA, Alm EJ. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nature communications. 2017;8(1):1784. doi: 10.1038/s41467-017-01973-8 29209090

61. Flemer B, Warren RD, Barrett MP, Cisek K, Das A, Jeffery IB, et al. The oral microbiota in colorectal cancer is distinctive and predictive. Gut. 2018;67(8):1454–1463. doi: 10.1136/gutjnl-2017-314814 28988196

62. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI. The human microbiome project. Nature. 2007;449(7164):804. doi: 10.1038/nature06244 17943116

63. Debelius JW, Vázquez-Baeza Y, McDonald D, Xu Z, Wolfe E, Knight R. Turning participatory microbiome research into usable data: lessons from the American Gut Project. Journal of microbiology & biology education. 2016;17(1):46. doi: 10.1128/jmbe.v17i1.1034

64. Waldor MK, Tyson G, Borenstein E, Ochman H, Moeller A, Finlay BB, et al. Where next for microbiome research? PLoS Biology. 2015;13(1):e1002050. doi: 10.1371/journal.pbio.1002050 25602283

65. Kyrpides NC, Eloe-Fadrosh EA, Ivanova NN. Microbiome data science: understanding our microbial planet. Trends in microbiology. 2016;24(6):425–427. doi: 10.1016/j.tim.2016.02.011 27197692

66. Weiss S, Van Treuren W, Lozupone C, Faust K, Friedman J, Deng Y, et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. The ISME journal. 2016;10(7):1669. doi: 10.1038/ismej.2015.235 26905627

67. Jeh G, Widom J. SimRank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2002. p. 538–543.

68. Chen HH, Giles CL. ASCOS: an asymmetric network structure context similarity measure. In: Advances in Social Networks Analysis and Mining (ASONAM), 2013 IEEE/ACM International Conference on. IEEE; 2013. p. 442–449.

69. Koutra D, Vogelstein JT, Faloutsos C. Deltacon: A principled massive-graph similarity function. In: Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM; 2013. p. 162–170.

70. Goldstein M, Uchida S. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PloS one. 2016;11(4):e0152173. doi: 10.1371/journal.pone.0152173 27093601

71. van Dam S, Vosa U, van der Graaf A, Franke L, de Magalhaes JP. Gene co-expression analysis for functional classification and gene–disease predictions. Briefings in bioinformatics. 2017;19(4):575–592.

72. He Y, Evans A. Graph theoretical modeling of brain connectivity. Current opinion in neurology. 2010;23(4):341–350.20581686

73. Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics. 2008;147(1):186–197. doi: 10.1016/j.jeconom.2008.09.017

74. Bickel PJ, Levina E, et al. Regularized estimation of large covariance matrices. The Annals of Statistics. 2008;36(1):199–227. doi: 10.1214/009053607000000758

75. Avella-Medina M, Battey HS, Fan J, Li Q. Robust estimation of high-dimensional covariance and precision matrices. Biometrika. 2018;105(2):271–284. doi: 10.1093/biomet/asy011 30337763

76. Ravikumar P, Wainwright MJ, Raskutti G, Yu B, et al. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. doi: 10.1214/11-EJS631

77. EL-Manzalawy Y, Hsieh TY, Shivakumar M, Kim D, Honavar V. Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data. BMC Medical Genomics. 2018;11(3):71. doi: 10.1186/s12920-018-0388-0 30255801

78. EL-Manzalawy Y. CCA based multi-view feature selection for multi-omics data integration. In: 2018 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB); 2018. p. 1–8.

79. Sun Y, Bui N, Hsieh TY, Honavar V. Multi-View Network Embedding Via Graph Factorization Clustering and Co-Regularized Multi-View Agreement. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE; 2018. p. 1006–1013.

Článek vyšel v časopise


2019 Číslo 11