An evaluation of different classification algorithms for protein sequence-based reverse vaccinology prediction

Autoři: Ashley I. Heinson aff001;  Rob M. Ewing aff002;  John W. Holloway aff003;  Christopher H. Woelk aff004;  Mahesan Niranjan aff005
Působiště autorů: Faculty of Medicine University of Southampton, Southampton, United Kingdom aff001;  Department of Biological Sciences University of Southampton, Southampton, United Kingdom aff002;  Faculty of Medicine, University of Southampton, Southampton, United Kingdom aff003;  Merck Exploratory Science Center, Cambridge, United States of America aff004;  Department of Electronics and Computer Science, University of Southampton, Southampton, United Kingdom aff005
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article
doi: 10.1371/journal.pone.0226256


Previous work has shown that proteins that have the potential to be vaccine candidates can be predicted from features derived from their amino acid sequences. In this work, we make an empirical comparison across various machine learning classifiers on this sequence-based inference problem. Using systematic cross validation on a dataset of 200 known vaccine candidates and 200 negative examples, with a set of 525 features derived from the AA sequences and feature selection applied through a greedy backward elimination approach, we show that simple classification algorithms often perform as well as more complex support vector kernel machines. The work also includes a novel cross validation applied across bacterial species, i.e. the validation proteins all come from a specific species of bacterium not represented in the training set. We termed this type of validation Leave One Bacteria Out Validation (LOBOV).

Klíčová slova:

Algorithms – Antibiotic resistance – Bacterial pathogens – Machine learning – Machine learning algorithms – Sequence motif analysis – Support vector machines – Vaccines


1. Ponomarenko EA, Poverennya EV, Ilgisonis EV, Mikhail AP, Kopylov AT, Zgoda VG et al. The Size of the Human Proteome: The Width and Depth. International journal of analytical chemistry. 2016.

2. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research. vol 31. pp 365–370. 2003.

3. Stylinaou E, Griffiths KL, Poyntz HC, Harrington-Kandt R, Dicks MD, Stockdale L, et al. Improvement of BCG protective efficacy with a novel chimpanzee adenovirus and a modified vaccinia Ankara virus both expressing Ag85A. Vaccine. vol 33. no 48. pp 6800–8. Nov 27 2015.

4. Ronning VDR, Bersa GS, Belise JT, Sacchenttini JC. Mycobacterium tuberculosis antigen 85A and 85C structures confirm binding orientation and conserved substrate specificity. The Journal of biological chemistry. 2004.

5. Consortium U. UniProt: the universal protein knowledgebase, Nucleic Acids Research. vol 45. 2016.

6. Durbin R, Kroch A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press. 1998.

7. Cuff JA, Siddiqui AS, Finlay M, Barton J. JPred: a consensus secondary structure prediction server. Bioinformatics. vol 14. pp 892–893. 1998.

8. Yu NY, Wagner JR, Laird MR, Meli G, Rew S, Lo R, et al. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics. vol 26. no 13. pp. 1608–1615. 2010.

9. Squires S, Ewing R, Prugal-Bennett A, Niranjan M. A Method of Integrating Spatial Proteomics and Protein-Protein Interaction Network Data. International Conference on Neural Information Processing. pp 782–790. 2017.

10. Shen J, Luo X, Zhu W, Yu K, Chen K, Jiang H. Predicting protein-protein interactions based only on sequences information, Proceedings of the National Academy of Sciences. vol 104. n. 11. pp 4337–4341. 2007.

11. Bradford JR. Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics. vol 21. no 8. 2005.

12. Wieser D. Remote homology detection using a kernel method that combines sequence and secondary-structure similarity scores. In silico biology. vol. 9. pp 89–103. 2009.

13. Roberts RR, Hota B, Ahmad I, Douglas S, Foster SD, Abbasi F, et al. Hospital and societal costs of antimicrobial-resistant infections in a Chicago teaching hospital: implications for antibiotic stewardship. Clin Infect Dis. vol 49. no 8. pp 1175–84. Oct 15 2009.

14. Galambos L. What are the prospects for a new golden era in vaccines?. Eurohealth. vol 14. no 1. 2008.

15. Heinson AI, Woelk CH, Newell ML. The promise of reverse vaccinology. International health. vol 7. no 2. pp 85–9. Mar 2015.

16. Heinson AI, Gunawardana Y, Moesker B, Denman-Hume CC, Vataga E, Hall Y, et al. Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology. International Journal of Molecular Sciences. vol 18. no. 2. Feb 01 2017.

17. Bowman BN, McAdam PR, Vivona S, Zhang JX, Luong T, Belew RK, et al. Improving reverse vaccinology with a machine learning approach. Vaccine. vol 29. no 45. pp 8156–64. Oct 19 2011.

18. World Health Organization. (2017, 06/03/18). Antibiotic Resistance.

19. U.S. Department of Health and Human Services Centers for Disease Control, and Prevention. (2013, 12/03/2017). Antibiotic Resistance Threats in the United States, 2013.

20. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. vol 215. no.3. pp 403–10, Oct 5 1990.

21. Pedregosa F, Varoquax G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research. vol 12. pp 2825–2830. 2011.

22. Cai R, Liu Z, Ren J, Ma C, Gao T, Zhou Y, et al. GPS-MBA: computational analysis of MHC class II epitopes in type 1 diabetes. PLoS One. vol 7. no 3. 2012.

23. Luckheeram RV, Zhou R, Verma AD, Xia B. CD4(+)T cells: differentiation and functions. Clinical and Developmental Immunology. vol 2012. p 925135. 2012.

24. Clam AS. Fundamentals of vaccine immunology, Journal of Global Infectious Diseases. vol 3. no 1. p 73. 2011.

25. Bachmann MF, Jennings GT. Vaccine delivery: a matter of size, geometry, kinetics and molecular patterns. Nat Rev Immunol. vol 10. no 11. pp 787–96. Nov 2010.

26. Gupta R, Jung E, Gooley AA, Williams KL, Brunak S, Hansen J. Scanning the available Dictyostelium discoideum proteome for O-linked GlcNAc glycosylation sites using neural networks. Glycobiology. vol 9. no 10. pp 1009–22. Oct 1999.

27. Johansen MB, Kiemer L, Brunak S. Analysis and prediction of mammalian protein glycation. Glycobiology. vol 16. no 9. pp 844–53. Sep 2006.

28. Miller ML, Soufi B, Jers C, Blom N, Macek B, Mijakovic I. NetPhosBac—a predictor for Ser/Thr phosphorylation sites in bacterial proteins. Proteomics. vol 9. no 1. pp 116–25. Jan 2009.

29. Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, et al. Protein identification and analysis tools in the ExPASy server. Methods Mol Biol. vol 112. pp 531–52. 1999.

30. Gupta R, Brunak S. Prediction of glycosylation across the human proteome and the correlation to protein function. Pac Symp Biocomput. pp 310–22. 2002. 11928486

31. Juncker AS, Willenbrock H, Heijne GV, Nielsen BH, Krogh A. Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Science: a Publication of the Protein Society. vol 12. no 8. pp 1652–62. Aug 2003.

32. Hansen JE, Lund O, Tolstrup N, Gooley AA, Williams KL, Brunak S. NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconj J. vol 15. no 2. pp 115–30. Feb 1998.

33. Liu Z, Cao J, Gao X, Ma Q, Ren J, Xue Y. GPS-CCD: a novel computational program for the prediction of calpain cleavage sites. PLoS One. vol 6. no 4. p e19001. Apr 20 2011.

34. Nielsen M, Lundegaard C, Blicher T, Lamberth K, Harndahl M, Justesen S. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PLoS ONE. vol 2. e796. 2007.

35. Zhang H, Lund O, Nielsen M. The PickPocket method for predicting binding specificities for receptors based on receptor pocket similarities: application to MHC-peptide binding. Bioinformatics. vol 25. no 10. pp 1293–9. May 15 2009.

36. Perrie Y, Mohammed AR, Kirby DJ, McNeil SE, Bramwell VW. Vaccine adjuvant systems: enhancing the efficacy of sub-unit protein antigens. International journal of pharmaceutics. vol 0378–5173. pp 272–280. 2008.

Článek vyšel v časopise


2019 Číslo 12