Predicting the occurrence of surgical site infections using text mining and machine learning

Autoři: Daniel A. da Silva aff001;  Carla S. ten Caten aff001;  Rodrigo P. dos Santos aff002;  Flavio S. Fogliatto aff001;  Juliana Hsuan aff003
Působiště autorů: Industrial Engineering Department, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil aff001;  Hospital de Clinicas de Porto Alegre, Porto Alegre, Brazil aff002;  Copenhagen Business School, Copenhagen, Denmark aff003
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article
doi: 10.1371/journal.pone.0226272


In this study we propose the use of text mining and machine learning methods to predict and detect Surgical Site Infections (SSIs) using textual descriptions of surgeries and post-operative patients’ records, mined from the database of a high complexity University hospital. SSIs are among the most common adverse events experienced by hospitalized patients; preventing such events is fundamental to ensure patients’ safety. Knowledge on SSI occurrence rates may also be useful in preventing future episodes. We analyzed 15,479 surgery descriptions and post-operative records testing different preprocessing strategies and the following machine learning algorithms: Linear SVC, Logistic Regression, Multinomial Naive Bayes, Nearest Centroid, Random Forest, Stochastic Gradient Descent, and Support Vector Classification (SVC). For prediction purposes, the best result was obtained using the Stochastic Gradient Descent method (79.7% ROC-AUC); for detection, Logistic Regression yielded the best performance (80.6% ROC-AUC).

Klíčová slova:

Adverse events – Algorithms – Data mining – Machine learning – Machine learning algorithms – Preprocessing – Surgical and invasive medical procedures – Text mining


1. Anvisa/Brasil. Infection Diagnostic Criteria Related to Healthcare. In: Agência Nacional de Vigilância Sanitária [Internet]. 2017. p. 13–88. Available from:

2. Wachter RM. Understanding Patient Safety. In: AMGH Editora. 2013. 479 p.

3. Stone PW, Kunches L, Hirschhorn L. Cost of hospital-associated infections in Massachusetts. Am J Infect Control. 2009;37(3):210–214. doi: 10.1016/j.ajic.2008.07.011 19111366

4. Bouzbid S, Gicquel Q, Gerbier S, Chomarat M, Pradat E, Fabry J, et al. Automated detection of nosocomial infections: Evaluation of different strategies in an intensive care unit 2000–2006. J Hosp Infect [Internet]. 2011;79(1):38–43. Available from: doi: 10.1016/j.jhin.2011.05.006 21742413

5. Michelson JD, Pariseau JS, Paganelli WC. Assessing surgical site infection risk factors using electronic medical records and text mining. Am J Infect Control [Internet]. 2014;42(3):333–336. Available from: doi: 10.1016/j.ajic.2013.09.007 24406258

6. Campillo-Gimenez B, Garcelon N, Jarno P, Chapplain JM, Cuggia M. Full-text automated detection of surgical site infections secondary to neurosurgery in Rennes, France. In: Studies in Health Technology and Informatics. 2013. p. 572–575. 23920620

7. Daltoé T, Breier A, dos Santos HB, Wagner MB, Kuchenbecker R de S. Hospital Infection Control Services: Characteristics, Dimensioning and Related Activities. Rev Soc Bras Clin Med. 2014;12(1):35–45.

8. Haley RW, Culver DH, White JW, Morgan WM, Emori TG, Munn VP, et al. The Efficacy of Infection Surveillance and Control Programs in Preventing Nosocomial Infections in Us Hospitals. Am J Epidemiol [Internet]. 1985;121(2):182–205. Available from: doi: 10.1093/oxfordjournals.aje.a113990 4014115

9. Brown KL, Ridout DA, Shaw M, Dodkins I, Smith LC, O’Callaghan MA, et al. Healthcare-associated infection in pediatric patients on extracorporeal life support: The role of multidisciplinary surveillance. Pediatr Crit Care Med [Internet]. 2006;7(6):546–550. Available from: doi: 10.1097/01.PCC.0000243748.74264.CE 17006389

10. Curran ET, Coia JE, Gilmour H, McNamee S, Hood J. Multi-centre research surveillance project to reduce infections/phlebitis associated with peripheral vascular catheters. J Hosp Infect. 2000;46(3):194–202. doi: 10.1053/jhin.2000.0831 11073728

11. Friedman C, Elhadad N. Natural Language Processing in Health Care and Biomedicine. In: Shortliffe EH, Cimino JJ, editors. Biomedical Informatics: Computer Applications in Health Care and Biomedicine [Internet]. London: Springer London; 2014. p. 255–284. Available from:

12. Freeman R, Moore LSP, García Álvarez L, Charlett A, Holmes A. Advances in electronic surveillance for healthcare-associated infections in the 21st Century: A systematic review. J Hosp Infect. 2013;84(2):106–119. doi: 10.1016/j.jhin.2012.11.031 23648216

13. Aramaki E, Miura Y, Tonoike M, Ohkuma T, Masuichi H, Waki K, et al. Extraction of adverse drug effects from clinical records. Stud Health Technol Inform [Internet]. 2010;160(Parte 1):739–743. Available from:

14. Bian J, Topaloglu U, Yu F. Towards large-scale twitter mining for drug-related adverse events. In: Proceedings of the 2012 international workshop on Smart health and wellbeing—SHB ‘12 [Internet]. 2012. p. 25. Available from:

15. Silva A, Cortez P, Santos MF, Gomes L, Neves J. Rating organ failure via adverse events using data mining in the intensive care unit. Artif Intell Med. 2008;43(0933–3657 Print):179–193.

16. Tan AH. Text Mining: The state of the art and the challenges. In: Proceedings of the PAKDD Workshop on Knowledge discovery from Advanced Databases. 1999. p. 71–76.

17. Han J, Kamber M. Data Mining: Concepts and Techniques. 2a ed. Soft Computing. 2006. 800 p.

18. Zafarani R, Abbasi MA, Liu H. Social media mining: An introduction. Cambridge University Press; 2014. 320 p.

19. Taylor RA, Moore CL, Cheung KH, Brandt C. Predicting urinary tract infections in the emergency department with machine learning. PLoS One. 2018;13(3).

20. Bartz-Kurycki MA, Green C, Anderson KT, Alder AC, Bucher BT, Cina RA, et al. Enhanced neonatal surgical site infection prediction model utilizing statistically and clinically significant variables in combination with a machine learning algorithm. Am J Surg. 2018;216(4):764–777. doi: 10.1016/j.amjsurg.2018.07.041 30078669

21. Wang Z, Shah AD, Tate AR, Denaxas S, Shawe-Taylor J, Hemingway H. Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning. PLoS One. 2012;7(1).

22. Zhang Y, Liu Z, Zhou W. Event recognition based on deep learning in Chinese texts. PLoS One. 2016;11(8).

23. Lucini FR, Fogliatto FS, da Silveira GJC, Neyeloff JL, Anzanello MJ, Kuchenbecker RDS, et al. Text mining approach to predict hospital admissions using early medical records from the emergency department. Int J Med Inform [Internet]. 2017;100:1–8. Available from: doi: 10.1016/j.ijmedinf.2017.01.001 28241931

24. PostgreSQL. PostgreSQL [Internet]. 2017. Available from:

25. Python Software Foundation. Comparing Python to Other Languages. 2017.

26. Bird S, Klein E, Loper E. Natural Language Processing with Python. O’Reilly Media, Inc. 2009;43:479.

27. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res [Internet]. 2011;12(112–113):2825–2830. Available from:

28. Feldmann R, Sanger J. The text mining handbook: Advanced approaches in analyzing unstructured data. New York: Cambridge Press; 2006. 257–300 p.

29. Perkins J. Python 3 Text Processing With NLTK 3 Cookbook [Internet]. Packt Publishing. Birmingham, UK; 2014. 304 p. Available from:

30. Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1–47.

31. Guyon I, Gunn S, Nikravesh M, Zadeh L. Feature Extraction, Foundations and Applications. Springer. New York; 2008. 778 p.

32. Sklearn. No Title. 2019.

33. Yi B-K, Faloutsos C. Fast time sequence indexing for arbitrary Lp norms. In: Proceedings of the 26st International Conference on VLDB. 2000. p. 385–394.

34. Witten, Frank, Hall. Data Mining: Practical Machine Learning Tools and Techniques (Google eBook). Complementary literature None. 2011. 664 p.

35. Ross M, Truong K, Lin K, Kumar A, Conway M. Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features. Biomed Inform Insights [Internet]. 2013;6:35. Available from: doi: 10.4137/BII.S11987 23926434

36. Yang M, Kiang M, Shang W. Filtering big data from social media—Building an early warning system for adverse drug reactions. J Biomed Inform. 2015;54:230–240. doi: 10.1016/j.jbi.2015.01.011 25688695

37. McCart JA, Berndt DJ, Jarman J, Finch DK, Luther SL. Finding falls in ambulatory care clinical documents using statistical text mining. J Am Med Informatics Assoc. 2013;20(5):906–914.

38. Botsis T, Nguyen MD, Woo EJ, Markatou M, Ball R. Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection. J Am Med Inform Assoc. 2011;18(5):631–638. doi: 10.1136/amiajnl-2010-000022 21709163

39. de Bruijn B, Cherry C, Kiritchenko S, Martin J, Zhu X. Machine-learned solutions for three stages of clinical information extraction: The state of the art at i2b2 2010. J Am Med Informatics Assoc. 2011;18(5):557–562.

40. Chee BW, Berlin R, Schatz B. Predicting adverse drug events from personal health messages. AMIA Annu Symp Proc [Internet]. 2011;2011:217–226. Available from: 22195073

41. Hur J, Özgür A, Xiang Z, He Y. Identification of fever and vaccine-associated gene interaction networks using ontology-based literature mining. J Biomed Semantics. 2012;3(1).

42. Genkin A, Lewis DD, Madigan D. Large-scale bayesian logistic regression for text categorization. Technometrics. 2007;49(3):291–304.

43. Ong MS, Magrabi F, Coiera E. Automated identification of extreme-risk events in clinical incident reports. J Am Med Informatics Assoc. 2012;19(E1).

44. Ramesh BP, Belknap SM, Li Z, Frid N, West DP, Yu H. Automatically recognizing medication and adverse event information from food and drug administration’s adverse event reporting system narratives. J Med Internet Res. 2014;16(6).

45. Rochefort CM, Verma AD, Eguale T, Lee TC, Buckeridge DL. A novel method of adverse event detection can accurately identify venous thromboembolisms (VTEs) from narrative electronic health record data. J Am Med Informatics Assoc. 2015;22(1):155–165.

46. Gurulingappa H, Rajput AM, Roberts A, Fluck J, Hofmann-Apitius M, Toldo L. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J Biomed Inform. 2012;45(5):885–892. doi: 10.1016/j.jbi.2012.04.008 22554702

47. Sklearn. No Title. 2019.

48. Sklearn. No Title [Internet]. 2019 [cited 2017 Feb 4]. Available from:

49. Bergstra J, Bengio Y. Random Search for Hyper-Parameter Optimization. J Mach Learn Res. 2012;13:281–305.

50. Dong Y, Li X, Li J, Zhao H. Analysis on weighted AUC for imbalanced data learning through isometrics. J Comput Inf Syst [Internet]. 2012;1(January):371–8. Available from:

51. Freifeld CC, Brownstein JS, Menone CM, Bao W, Filice R, Kass-Hout T, et al. Digital drug safety surveillance: Monitoring pharmaceutical products in Twitter. Drug Saf. 2014;37(5):343–350. doi: 10.1007/s40264-014-0155-x 24777653

52. Mangram AJ, Horan TC, Pearson ML, Silver LC, Jarvis WR. Guideline for Prevention of Surgical Site Infection, 1999. Hospital Infection Control Practices Advisory Committee. Infect Control Hosp Epidemiol [Internet]. 1999;20(04):250–78; quiz 279–80. Available from:

53. Michelson J. Improved detection of orthopaedic surgical site infections occurring in outpatients. Clin Orthop Relat Res. 2005;(433):218–224. doi: 10.1097/01.blo.0000150666.06175.6b 15805961

54. Chandrasekhar CK, Srinivasan MR, Ramesh Babu B. Bootstrapping in text mining applications. Int J Science and Research 2016; 5(1): 337–344.

Článek vyšel v časopise


2019 Číslo 12