An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences

Autoři: Siquan Hu aff001;  Ruixiong Ma aff001;  Haiou Wang aff003
Působiště autorů: School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China aff001;  Sichuan Jiuzhou Video Technology Co., Ltd, Mianyang, China aff002;  School of Chemistry and Biological Engineering, University of Science and Technology Beijing, Beijing, China aff003
Vyšlo v časopise: PLoS ONE 14(11)
Kategorie: Research Article
doi: 10.1371/journal.pone.0225317


As the number of known proteins has expanded, how to accurately identify DNA binding proteins has become a significant biological challenge. At present, various computational methods have been proposed to recognize DNA-binding proteins from only amino acid sequences, such as SVM, DNABP and CNN-RNN. However, these methods do not consider the context in amino acid sequences, which makes it difficult for them to adequately capture sequence features. In this study, a new method that coordinates a bidirectional long-term memory recurrent neural network and a convolutional neural network, called CNN-BiLSTM, is proposed to identify DNA binding proteins. The CNN-BiLSTM model can explore the potential contextual relationships of amino acid sequences and obtain more features than can traditional models. The experimental results show that the CNN-BiLSTM achieves a validation set prediction accuracy of 96.5%—7.8% higher than that of SVM, 9.6% higher than that of DNABP and 3.7% higher than that of CNN-RNN. After testing on 20,000 independent samples provided by UniProt that were not involved in model training, the accuracy of CNN-BiLSTM reached 94.5%—12% higher than that of SVM, 4.9% higher than that of DNABP and 4% higher than that of CNN-RNN. We visualized and compared the model training process of CNN-BiLSTM with that of CNN-RNN and found that the former is capable of better generalization from the training dataset, showing that CNN-BiLSTM has a wider range of adaptations to protein sequences. On the test set, CNN-BiLSTM has better credibility because its predicted scores are closer to the sample labels than are those of CNN-RNN. Therefore, the proposed CNN-BiLSTM is a more powerful method for identifying DNA-binding proteins.

Klíčová slova:

Deep learning – DNA-binding proteins – Machine learning – Machine learning algorithms – Protein sequencing – Protein structure prediction – Recurrent neural networks – Support vector machines


1. Kumar M, Gromiha MM, Raghava GP. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC bioinformatics. 2007 Dec;8(1):463.

2. Luscombe NM, Austin SE, Berman HM, Thornton JM. An overview of the structures of protein-DNA complexes. Genome biology. 2000 Feb;1(1): reviews 001–1.

3. Stawiski EW, Gregoret LM, Mandel-Gutfreund Y. Annotating nucleic acid-binding function based on protein structure. Journal of molecular biology. 2003 Feb 28;326(4):1065–79. doi: 10.1016/s0022-2836(03)00031-7 12589754

4. Ahmad S, Gromiha MM, Sarai A. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics. 2004 Jan 22;20(4):477–86. doi: 10.1093/bioinformatics/btg432 14990443

5. Bowen B, Steinberg J, Laemmli U, Weintraub H. The detection of DNA-binding proteins by protein blotting. Nucleic Acids Research. 1980 Jan 11;8(1):1–20. doi: 10.1093/nar/8.1.1 6243775

6. Hugh P, Mario A, Susan Jones, Janet M Thornton. Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Research, 2004, 32(16), 4732–4741. doi: 10.1093/nar/gkh803 15356290

7. Qu YH, Yu H, Gong XJ, Xu JH, Lee HS. On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach. PloS one. 2017 Dec 29;12(12): e0188129. doi: 10.1371/journal.pone.0188129 29287069

8. Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naive Bayes. PLoS One. 2014 Jan 24;9(1): e86703. doi: 10.1371/journal.pone.0086703 24475169

9. Brown JB, Akutsu T. Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology. BMC bioinformatics. 2009 Dec;10(1):25.

10. Yan C, Terribilini M, Wu F, Jernigan RL, Dobbs D, Honavar V. Predicting DNA-binding sites of proteins from amino acid sequence. BMC bioinformatics. 2006 Dec;7(1):262.

11. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. InProceedings of the 23rd international conference on Machine learning 2006 Jun 25 (pp. 161–168). ACM.

12. Cai YD, Lin SL. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics. 2003 May 30;1648(1–2):127–33.

13. Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PloS one. 2011 Sep 15;6(9): e24756. doi: 10.1371/journal.pone.0024756 21935457

14. Wang Y, Ding Y, Guo F, Wei L, Tang J. Improved detection of DNA-binding proteins via compression technology on PSSM information[J]. PloS one, 2017, 12(9): e0185587. doi: 10.1371/journal.pone.0185587 28961273

15. Zou C, Gong J, Li H. An improved sequence-based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC bioinformatics. 2013 Dec;14(1):90.

16. Rahman M S, Shatabda S, Saha S, Kaykobad M, Rahman M S. DPP-PseAAC: a DNA-binding protein prediction model using Chou’s general PseAAC[J]. Journal of theoretical biology, 2018, 452: 22–34. doi: 10.1016/j.jtbi.2018.05.006 29753757

17. Chowdhury S Y, Shatabda S, Dehzangi A. iDNAprot-es: Identification of DNA-binding proteins using evolutionary and structural features[J]. Scientific reports, 2017, 7(1): 14938. doi: 10.1038/s41598-017-14945-1 29097781

18. Liu X. J, Gong X. J, Yu H, Xu J. H. A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers[J]. Genes, 2018, 9(8): 394.

19. Adilina S, Farid D M, Shatabda S. Effective DNA binding protein prediction by using key features via Chou’s general PseAAC[J]. Journal of theoretical biology, 2019, 460: 64–78. doi: 10.1016/j.jtbi.2018.10.027 30316822

20. Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, et al. iDNA-Prot| dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PloS one. 2014 Sep 3;9(9): e106691. doi: 10.1371/journal.pone.0106691 25184541

21. Ma X, Guo J, Sun X. DNABP: Identification of DNA-binding proteins based on feature selection using a random forest and predicting binding residues. PloS one. 2016 Dec 1;11(12): e0167345. doi: 10.1371/journal.pone.0167345 27907159

22. Bhardwaj N, Langlois RE, Zhao G, Lu H. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Research. 2005 Jan 1;33(20):6486–93. doi: 10.1093/nar/gki949 16284202

23. Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. Journal of Theoretical Biology. 2006 May 21;240(2):175–84. doi: 10.1016/j.jtbi.2005.09.018 16274699

24. Qiu J, Wu Q, Ding G, Xu Y, Feng S. A survey of machine learning for big data processing. EURASIP Journal on Advances in Signal Processing. 2016 Dec 1;2016(1):67.

25. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. InAdvances in neural information processing systems 2012 (pp. 1097–1105).

26. Graves A, Mohamed AR, Hinton G. Speech recognition with deep recurrent neural networks. InAcoustics, speech and signal processing (icassp), 2013 ieee international conference on 2013 May 26 (pp. 6645–6649). IEEE.

27. Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. InAdvances in neural information processing systems 2014 (pp. 3104–3112).

28. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology. 2015 Aug;33(8):831. doi: 10.1038/nbt.3300 26213851

29. Zeng H, Edwards MD, Liu G, Gifford DK. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics. 2016 Jun 11;32(12): i121–7. doi: 10.1093/bioinformatics/btw255 27307608

30. Zhang Qinhu, Zhu Lin, Bao Wenzheng, Huang De-shuang. Weakly-Supervised Convolutional Neural Network Architecture for Predicting Protein-DNA Binding[J]. IEEE/ACM transactions on computational biology and bioinformatics, 2018.

31. Melamud O, Goldberger J, Dagan I. context2vec: Learning generic context embedding with bidirectional lstm. InProceedings of the 20th SIGNLL Conference on Computational Natural Language Learning 2016 (pp. 51–61).

32. Yaseen A, Li Y. Context-based features enhance protein secondary structure prediction accuracy. Journal of chemical information and modeling. 2014 Mar 12;54(3):992–1002. doi: 10.1021/ci400647u 24571803

33. Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Molecular biology and evolution. 2015 Oct 6;33(1):268–80. doi: 10.1093/molbev/msv211 26446903

34. Garnier J, Gibrat JF, Robson B. [32] GOR method for predicting protein secondary structure from amino acid sequence. InMethods in enzymology 1996 Jan 1 (Vol. 266, pp. 540–553). Academic Press.

35. Starosta AL, Lassak J, Peil L, Atkinson GC, Virumäe K, Tenson T, et al. Translational stalling at polyproline stretches is modulated by the sequence context upstream of the stall site. Nucleic acids research. 2014 Aug 20;42(16):10711–9. doi: 10.1093/nar/gku768 25143529

36. Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 2014 (pp. 1532–1543).

37. Wang P, Qian Y, Soong FK, He L, Zhao H. A unified tagging solution: Bidirectional LSTM recurrent neural network with word embedding. arXiv preprint arXiv:1511.00215. 2015 Nov 1.

38. Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. 2015 Aug 9.

39. Pichler K, Warner K, Magrane M, UniProt Consortium. SPIN: Submitting Sequences Determined at Protein Level to UniProt Curr. Protoc. Bioinformatics 62(1):e52 (2018). doi: 10.1002/cpbi.52 29927080

40. Motion GB, Howden AJ, Huitema E, Jones S. DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool. Nucleic acids research. 2015 Aug 24;43(22):e158. doi: 10.1093/nar/gkv805 26304539

41. LeCun Y, Bengio Y, Hinton G. Deep learning. nature. 2015 May;521(7553):436. doi: 10.1038/nature14539 26017442

42. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. InAdvances in neural information processing systems 2012 (pp. 1097–1105).

43. Medsker LR, Jain LC. Recurrent neural networks. Design and Applications. 2001;5.

44. Hochreiter S, Schmidhuber J. LSTM can solve hard long-time lag problems. InAdvances in neural information processing systems 1997 (pp. 473–479).

45. Zhang S, Zheng D, Hu X, Yang M. Bidirectional long short-term memory networks for relation classification. InProceedings of the 29th Pacific Asia Conference on Language, Information and Computation 2015 (pp. 73–78).

46. Dobzhansky T. Nothing in biology makes sense except in the light of evolution. The american biology teacher. 2013 Feb;75(2):87–91.

47. Chollet F. Keras: The python deep learning library[J]. Astrophysics Source Code Library, 2018.

48. Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic acids research. 2008 Apr 4;36(9):3025–30. doi: 10.1093/nar/gkn159 18390576

49. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein—protein interactions based only on sequences information. Proceedings of the National Academy of Sciences. 2007 Mar 13;104(11):4337–41.

Článek vyšel v časopise


2019 Číslo 11