A study on separation of the protein structural types in amino acid sequence feature spaces

Autoři: Xiaogeng Wan aff001;  Xinying Tan aff002
Působiště autorů: College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, China aff001;  The Fourth Center of PLA General Hospital, Beijing, China aff002
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article
doi: 10.1371/journal.pone.0226768


Proteins are diverse with their sequences, structures and functions, it is important to study the relations between the sequences, structures and functions. In this paper, we conduct a study that surveying the relations between the protein sequences and their structures. In this study, we use the natural vector (NV) and the averaged property factor (APF) features to represent protein sequences into feature vectors, and use the multi-class MSE and the convex hull methods to separate proteins of different structural classes into different regions. We found that proteins from different structural classes are separable by hyper-planes and convex hulls in the natural vector feature space, where the feature vectors of different structural classes are separated into disjoint regions or convex hulls in the high dimensional feature spaces. The natural vector outperforms the averaged property factor method in identifying the structures, and the convex hull method outperforms the multi-class MSE in separating the feature points. These outcomes convince the strong connections between the protein sequences and their structures, and may imply that the amino acids composition and their sequence arrangements represented by the natural vectors have greater influences to the structures than the averaged physical property factors of the amino acids.

Klíčová slova:

Machine learning – Protein sequencing – Protein structure – Protein structure databases – Sequence alignment – Sequence databases – Structural proteins – Vector spaces


1. Levitt M. Nature of the protein universe. Proceedings of the National Academy of Sciences of the United States of America. 2009; 106 (27): 11079–84. doi: 10.1073/pnas.0905029106 19541617

2. Yau ST, Yu C, He RL. A protein map and its application. DNA and Cell Biology. 2008; 27: 241250.

3. Yu C, Cheng SY, He RL, Yau ST. Protein map: An alignment-free sequence comparison method based on various properties of amino acids. Gene. 2011; 486(1–2): 110–118. doi: 10.1016/j.gene.2011.07.002 21803133

4. Yu C, Deng M, Cheng SY, Yau SC, He RL, Yau ST. Protein space: A natural method for realizing the nature of protein universe. Journal of Theoretical Biology. 2013; 318:197–204. doi: 10.1016/j.jtbi.2012.11.005 23154188

5. Zhao B, He RL, Yau ST. A new distribution vector and its application in genome clustering. Molecular Phylogenetics and Evolution. 2011; 59: 438–443. doi: 10.1016/j.ympev.2011.02.020 21385621

6. Zhao X, Wan X, He RL, Yau ST. A new method for studying the evolutionary origin of the SAR11 clade marine bacteria. Molecular Phylogenetics and Evolution. 2016; 98: 271–279. doi: 10.1016/j.ympev.2016.02.015 26926946

7. Yu C, He RL, Yau ST. Protein sequence comparison based on K-string dictionary. Gene. 2013; 529: 250–256. doi: 10.1016/j.gene.2013.07.092 23939466

8. Ding CHQ, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics. 2001; 17(4), 349–358. doi: 10.1093/bioinformatics/17.4.349 11301304

9. Edler L, Grassmann J, Suhai S. Role and results of statistical methods in protein fold class prediction. Mathematical and Computer Modelling. 2001; 33(12–13): 1401–1417.

10. Huang CD, Lin CT, Pal NR. Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification. IEEE transactions on NanoBioscience. 2003; 2(4): 221–232. doi: 10.1109/tnb.2003.820284 15376912

11. Jo T, Hou J, Eickholt J, Cheng J. Improving protein fold recognition by deep learning networks. Scientific reports. 2015; 5: 17573. doi: 10.1038/srep17573 26634993

12. Khan MA, Shahzad W, Baig AR. Protein classification via an ant-inspired association rules-based classifier. International Journal of Bio-Inspired Computation. 2016; 8(1): 51–65.

13. Markowetz F, Edler L, Vingron M. Support vector machines for protein fold class prediction. Biometrical Journal: Journal of Mathematical Methods in Biosciences. 2003; 45(3): 377–389.

14. Tan AC, Gilbert D, Deville Y. Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics. 2003; 14: 206–217. 15706535

15. Wei L, Liao M, Gao X, Zou Q. Enhanced protein fold prediction method through a novel feature extraction technique. IEEE transactions on nanobioscience. 2015; 14(6): 649–659. doi: 10.1109/TNB.2015.2450233 26335556

16. Wei L, Zou Q. Recent progress in machine learning-based methods for protein fold recognition. International journal of molecular sciences. 2016; 17(12): 2118.

17. Wang J, Wang Z, Tian X. Bioinformatics: Fundamentals and Applications. Tsinghua University Press. 2014.

18. Rackovsky S. Sequence physical properties encode the global organization of protein structure space. PNAS. 2009; 106(34): 14345–14348. doi: 10.1073/pnas.0903433106 19706520

19. Duda RO, Hart PE, Stork DG. Pattern Classification, second Edition. China Machine Press. 2001.

20. Tian K, Zhao X, Yau ST. Convex hull analysis of evolutionary and phylogenetic relationships between biological groups. Journal of Theoretical Biology. 2018; 456: 34–40. doi: 10.1016/j.jtbi.2018.07.035 30059661

21. Shen HB, Chou KC. PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Analytical Biochemistry. 2008; 373(2): 386–388. doi: 10.1016/j.ab.2007.10.012 17976365

22. Liu B, Liu F, Wang X, Chen J, Fang L and Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research. 2015; 43 (W1): W65–W71. doi: 10.1093/nar/gkv458 25958395

23. Xu Y, Ding J, Wu LY, Chou KC. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS ONE. 2013; 8(2): e55844. doi: 10.1371/journal.pone.0055844 23409062

24. Gribskov M, Mclachlan AD, Eisenberg D. Profile analysis: detection of distantly related proteins. Proceedings of the National Academy of Sciences. 1987; 84(13), 4355–4358.

25. Jeong JC, Lin X, Chen XW. On position-specific scoring matrix for protein function prediction. IEEE/ACM Transactions on Computational Biology & Bioinformatics. 2011; 8 (2), 308–315.

26. Hsu C, Chang C, Lin C. A practical guide to support vector classification. BJU International. 2008; 101(1):1396–1400.

27. Breiman L. Random Forests. Machine Learning. 2001; 45 (1): 5–32.

28. Lim A., Breiman L, Cutler A. Big random forests: classification and regression forests for large data sets. 2014.

29. Kidera A, Konishi Y, Oka M, Ooi T, Scheraga HA. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. Journal of Protein Chemistry. 1985; 4(1): 23–55.

30. Kidera A, Konishi Y, Ooi T, Scheraga HA. Relation between sequence similarity and structural similarity in proteins: Role of important properties of amino acids. Journal of Protein Chemistry. 1985; 4(5):265–297.

31. Chang CC and Lin CJ. LibSVM: A Library for support vector machines. ACM Transactions on Intelligent Systems & Technology. 2011; 2(3): 27.

32. Lin C, Chen W, Qiu C, Wu Y, Krishnan S, Zou Q. LibD3C: Ensemble Classifiers with a Clustering and Dynamic Selection Strategy. Neurocomputing. 2014; 123: 424–435.

33. Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS ONE. 2013; 8(2): e56499. doi: 10.1371/journal.pone.0056499 23437146

Článek vyšel v časopise


2019 Číslo 12