TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology

Autoři: Yong Fang aff001;  Shengjun Han aff001;  Cheng Huang aff001;  Runpu Wu aff002
Působiště autorů: College of Cybersecurity, Sichuan University, Chengdu 610065, China aff001;  China Information Technology Security Evaluation Center, Beijing 100085, China aff002
Vyšlo v časopise: PLoS ONE 14(11)
Kategorie: Research Article
doi: 10.1371/journal.pone.0225196


With the widespread usage of Web applications, the security issues of source code are increasing. The exposed vulnerabilities seriously endanger the interests of service providers and customers. There are some models for solving this problem. However, most of them rely on complex graphs generated from source code or regex patterns based on expert experience. In this paper, TAP, which is based on token mechanism and deep learning technology, was proposed as an analysis model to discover the vulnerabilities of PHP: Hypertext Preprocessor (PHP) Web programs conveniently and easily. Based on the token mechanism of PHP language, a custom tokenizer was designed, and it unifies tokens, supports some features of PHP and optimizes the parsing. Besides, the tokenizer also implements parameter iteration to achieve data flow analysis. On the Software Assurance Reference Dataset(SARD) and SQLI-LABS dataset, we trained the deep learning model of TAP by combining the word2vec model with Long Short-Term Memory (LSTM) network algorithm. According to the experiment on the dataset of CWE-89, TAP not only achieves the 0.9941 Area Under the Curve(AUC), which is better than other models, but also achieves the highest accuracy: 0.9787. Further, compared with RIPS, TAP shows much better in multiclass classification with 0.8319 Kappa and 0.0840 hamming distance.

Klíčová slova:

Deep learning – Graphs – Internet – Machine learning – Memory recall – National security – Source code – Web-based applications


1. Cheswick WR, Bellovin SM, Rubin AD. Firewalls and Internet security: repelling the wily hacker. Addison-Wesley Longman Publishing Co., Inc.; 2003.

2. Wang Y, Shen Y, Wang H, Cao J, Jiang X. MtMR: Ensuring MapReduce Computation Integrity with Merkle Tree-based Verifications. IEEE Transactions on Big Data. 2016;4(3):418–431. doi: 10.1109/TBDATA.2016.2599928

3. Shu J, Jia X, Yang K, Wang H. Privacy-preserving task recommendation services for crowdsourcing. IEEE Transactions on Services Computing. 2018;. doi: 10.1109/TSC.2018.2791601

4. F5 Labs 2018 Application Protection Report;. Available from: https://www.f5.com/content/dam/f5/f5-labs/articles/20180725_app_protect_report/F5_Labs_2018_Application_Protection_Report.pdf.

5. Usage of content management systems for websites; 2019. Available from: https://w3techs.com/technologies/overview/content_management/all/.

6. CVE-2017-5223; 2017. Available from: http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-5223.

7. NVD—CVE-2016-10033; 2016. Available from: https://nvd.nist.gov/vuln/detail/CVE-2016-10033.

8. Dahse J, Schwenk J. RIPS-A static source code analyser for vulnerabilities in PHP scripts. In: Seminar Work (Seminer Çalismasi). Horst Görtz Institute Ruhr-University Bochum; 2010.

9. Jovanovic N, Kruegel C, Kirda E. Pixy: A static analysis tool for detecting web application vulnerabilities. In: 2006 IEEE Symposium on Security and Privacy (S&P’06). IEEE; 2006. p. 6–pp.

10. Son S, Shmatikov V. SAFERPHP: Finding semantic vulnerabilities in PHP applications. In: Proceedings of the ACM SIGPLAN 6th Workshop on Programming Languages and Analysis for Security. ACM; 2011. p. 8.

11. Tip F. A survey of program slicing techniques. Centrum voor Wiskunde en Informatica; 1994.

12. Dahse J, Krein N, Holz T. Code reuse attacks in php: Automated pop chain generation. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. ACM; 2014. p. 42–53.

13. Yamaguchi F, Lindner F, Rieck K. Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of the 5th USENIX conference on Offensive technologies. USENIX Association; 2011. p. 13–13.

14. Yamaguchi F, Golde N, Arp D, Rieck K. Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on Security and Privacy. IEEE; 2014. p. 590–604.

15. Backes M, Rieck K, Skoruppa M, Stock B, Yamaguchi F. Efficient and flexible discovery of php application vulnerabilities. In: 2017 IEEE european symposium on security and privacy (EuroS&P). IEEE; 2017. p. 334–349.

16. Russell R, Kim L, Hamilton L, Lazovich T, Harer J, Ozdemir O, et al. Automated Vulnerability Detection in Source Code Using Deep Representation Learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE; 2018. p. 757–762.

17. King JC. Symbolic execution and program testing. Communications of the ACM. 1976;19(7):385–394. doi: 10.1145/360248.360252

18. Lawrence S, Giles CL, Tsoi AC, Back AD. Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks. 1997;8(1):98–113. doi: 10.1109/72.554195 18255614

19. Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, et al. VulDeePecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:180101681. 2018;.

20. Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks. 2005;18(5-6):602–610. doi: 10.1016/j.neunet.2005.06.042 16112549

21. Alhuzali A, Gjomemo R, Eshete B, Venkatakrishnan V. NAVEX: Precise and Scalable Exploit Generation for Dynamic Web Applications. In: 27th USENIX Security Symposium (USENIX Security 18); 2018. p. 377–392.

22. Doupé A, Boe B, Kruegel C, Vigna G. Fear the EAR: discovering and mitigating execution after redirect vulnerabilities. In: Proceedings of the 18th ACM conference on Computer and communications security. ACM; 2011. p. 251–262.

23. PHP: token_get_all—Manual;. Available from: http://www.php.net/token-get-all.

24. PHP: List of Parser Tokens—Manual;. Available from: http://php.net/manual/en/tokens.php.

25. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. p. 3111–3119.

26. Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735 9377276

27. Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization. arXiv preprint arXiv:14092329. 2014;.

28. Stivalet B, Fong E. Large scale generation of complex and faulty PHP test cases. In: 2016 IEEE International conference on software testing, verification and validation (ICST). IEEE; 2016. p. 409–415.

29. Software Assurance Reference Dataset;. Available from: https://samate.nist.gov/SARD/index.php.

30. Pendleton M, Garcia-Lebron R, Cho JH, Xu S. A survey on systems security metrics. ACM Computing Surveys (CSUR). 2017;49(4):62.

31. Fawcett T. An introduction to ROC analysis. Pattern recognition letters. 2006;27(8):861–874. doi: 10.1016/j.patrec.2005.10.010

32. Landis JR, Koch GG. The measurement of observer agreement for categorical data. biometrics. 1977; p. 159–174.

33. Kronjee J, Hommersom A, Vranken H. Discovering software vulnerabilities using data-flow analysis and machine learning. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. acm; 2018. p. 6.

34. jorkro/wirecaml: Weakness Identification Research Employing CFG Analysis and Machine Learning;. Available from: https://github.com/jorkro/wirecaml.

35. ripsscanner/rips: RIPS—A static source code analyser for vulnerabilities in PHP scripts;. Available from: https://github.com/ripsscanner/rips.

Článek vyšel v časopise


2019 Číslo 11