Enhancing timeliness of drug overdose mortality surveillance: A machine learning approach

Autoři: Patrick J. Ward aff001;  Peter J. Rock aff001;  Svetla Slavova aff001;  April M. Young aff002;  Terry L. Bunn aff001;  Ramakanth Kavuluru aff006
Působiště autorů: Kentucky Injury Prevention and Research Center, College of Public Health, University of Kentucky, Lexington, Kentucky, United States of America aff001;  Department of Epidemiology, College of Public Health, University of Kentucky, Lexington, Kentucky, United States of America aff002;  Department of Biostatistics, College of Public Health, University of Kentucky, Lexington, Kentucky, United States of America aff003;  Center on Drug and Alcohol Research, College of Medicine, University of Kentucky, Lexington, Kentucky, United States of America aff004;  Department of Preventive Medicine and Environmental Health, College of Public Health, University of Kentucky, Lexington, Kentucky, United States of America aff005;  Department of Computer Science, College of Engineering, University of Kentucky, Lexington, Kentucky, United States of America aff006;  Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, Kentucky, United States of America aff007
Vyšlo v časopise: PLoS ONE 14(10)
Kategorie: Research Article
doi: 10.1371/journal.pone.0223318



Timely data is key to effective public health responses to epidemics. Drug overdose deaths are identified in surveillance systems through ICD-10 codes present on death certificates. ICD-10 coding takes time, but free-text information is available on death certificates prior to ICD-10 coding. The objective of this study was to develop a machine learning method to classify free-text death certificates as drug overdoses to provide faster drug overdose mortality surveillance.


Using 2017–2018 Kentucky death certificate data, free-text fields were tokenized and features were created from these tokens using natural language processing (NLP). Word, bigram, and trigram features were created as well as features indicating the part-of-speech of each word. These features were then used to train machine learning classifiers on 2017 data. The resulting models were tested on 2018 Kentucky data and compared to a simple rule-based classification approach. Documented code for this method is available for reuse and extensions: https://github.com/pjward5656/dcnlp.


The top scoring machine learning model achieved 0.96 positive predictive value (PPV) and 0.98 sensitivity for an F-score of 0.97 in identification of fatal drug overdoses on test data. This machine learning model achieved significantly higher performance for sensitivity (p<0.001) than the rule-based approach. Additional feature engineering may improve the model’s prediction. This model can be deployed on death certificates as soon as the free-text is available, eliminating the time needed to code the death certificates.


Machine learning using natural language processing is a relatively new approach in the context of surveillance of health conditions. This method presents an accessible application of machine learning that improves the timeliness of drug overdose mortality surveillance. As such, it can be employed to inform public health responses to the drug overdose epidemic in near-real time as opposed to several weeks following events.

Klíčová slova:

Deep learning – Disease surveillance – Drug research and development – Machine learning – Natural language processing – Opioids – Public and occupational health – Support vector machines


1. Warner M, Hedegaard H. Identifying opioid overdose deaths using vital statistics data. Am J. Public Health. 2018;108(12):1587–9. Epub 2018/11/08. doi: 10.2105/AJPH.2018.304781 30403503.

2. Williams KE, Freeman MD, Mirigian L. Drug overdose aurveillance and information sharing via a public database: the role of the medical examiner/coroner. Acad Forensic Pathol. 2017;7(1):60–72. doi: 10.23907/2017.007 31239957

3. Association of State and Territorial Health Officials. Improving drug specificity and completeness on death certificates for overdose deaths: opportunities and challenges for states. Stakeholder Meeting Report. Feb. 23, 2018.

4. National Center for Health Statistics. Medical examiners’ and coroners’ handbook on death registration and fetal death reporting. 2003 revision. Centers for Disease Control and Prevention: 2003.

5. Ruiz L, Posey BM, Neuilly MA, Stohr MK, Hemmens C. Certifying death in the United States. J Forensic Sci. 2017. Epub 2017/11/17. doi: 10.1111/1556-4029.13689 29143322.

6. National Center for Health Statistics. U.S. Standard Certificate of Death. 2003 revision. Centers for Disease Control and Prevention: 2003.

7. National Center for Health Statistics. Instructions for classifying the underlying cause-of-death, ICD-10, 2017. Centers for Disease Control and Prevention: 2017.

8. National Center for Health Statistics. Instructions for classifying the multiple causes of death, ICD-10, 2017. Centers for Disease Control and Prevention: 2017.

9. WHO. International classification of diseases, tenth revision: version 2016: World Health Organization; 2016. http://apps.who.int/classifications/icd10/browse/2016/en.

10. WHO. International statistical classification of diseases and related health problems. 10th revision. Volume 2. Instruction manual. 2011.

11. Injury Surveillance Workgroup 7. Consensus recommendations for national and state poisoning surveillance. http://www.safestates.org/?page=ISWReports. The Safe States Alliance, April 2012.

12. Hedegaard H, Miniño A, Warner M. Drug overdose deaths in the United States, 1999–2017. National Center for Health Statistics Data Brief no 329. November 2018.

13. Spencer M, Ahmad F. Timeliness of Death Certificate Data for Mortality Surveillance and Provisional Estimates. National Center for Health Statistics: January 2017.

14. Trinidad JP, Warner M, Bastian BA, Miniño AM, Hedegaard H. Using literal text from the death certificate to enhance mortality statistics: characterizing drug involvement in deaths. National Vital Statistics Reports. 2016;65(9):1–15. 27996933

15. Centers for Disease Control and Prevention. Enhanced State Opioid Overdose Surveillance; 2018. https://www.cdc.gov/drugoverdose/foa/state-opioid-mm.html.

16. Osborne JD, Wyatt M, Westfall AO, Willig J, Bethard S, Gordon G. Efficient identification of nationally mandated reportable cancer cases using natural language processing and machine learning. J. Am. Med. Inform. Assoc. 2016;23:1077–84. doi: 10.1093/jamia/ocw006 27026618

17. Kavuluru R, Hands I, Durbin EB, Witt L. Automatic extraction of ICD-O-3 primary sites from cancer pathology reports. AMIA Summits on Translational Science Proceedings. 2013;112.

18. Rios A, Kavuluru R. Ordinal convolutional neural networks for predicting RDoC positive valence psychiatric symptom severity scores. J Biomed Inform. 2017;75:S85–S93.

19. Rios A, Kavuluru R. Few-shot and zero-shot multi-label learning for structured label spaces. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; 2018.

20. Simpson MS, Demner-Fushman D. Biomedical text mining: a survey of recent progress. Mining text data: Springer; 2012. p. 465–517.

21. Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform. 2009;42(5):760–72. doi: 10.1016/j.jbi.2009.08.007 19683066

22. Warner M, Paulozzi LJ, Nolte KB, Davis GG, Nelson LS. State variation in certifying manner of death and drugs involved in drug intoxication deaths. Acad Forensic Pathol. 2013;3(2):231–7. doi: 10.23907/2013.029

23. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12(Oct):2825–30.

24. Butt L, Zuccon G, Nguyen A, Bergheim A, Grayson N. Classification of cancer-related death certificates using machine learning. Australas Med J. 2013;6(5):292–9. Epub 2013/06/08. 23745151.

25. Koopman B, Karimi S, Nguyen A, McGuire R, Muscatello D, Kemp M, et al. Automatic classification of diseases from free-text death certificates for real-time surveillance. BMC Med Infrom Decis. 2015;15:53. Epub 2015/07/16. doi: 10.1186/s12911-015-0174-2 26174442.

26. Koopman B, Zuccon G, Nguyen A, Bergheim A, Grayson N. Automatic ICD-10 classification of cancers from free-text death certificates. Int J Med Inform. 2015;84(11):956–65. Epub 2015/09/02. doi: 10.1016/j.ijmedinf.2015.08.004 26323193.

27. Kuhn M. Caret package. J State Softw. 2008;28(5):1–26.

28. Kim J-H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Comput Stat Data An. 2009;53(11):3735–45. https://doi.org/10.1016/j.csda.2009.04.009.

29. Witten IH, Frank E. Data Mining: Practical Learning Tools and Techniques with Java Implementations. San Diego, CA: Academic Press; 2000.

30. Sokolova M, Japkowicz N, Szpakowicz S, editors. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. Australasian Joint Conference on Artificial Intelligence; 2006: Springer.

31. Hsu C-W, Chang C-C, Lin C-J. A practical guide to support vector classification. 2010.

32. Riedl B, Than N, Hogarth M. Using the UMLS and simple statistical methods to semantically categorize causes of death on death certificates. AMIA Annu Symp Proc. 2010;2010:677–81. Epub 2011/02/25. 21347064.

33. Duarte F, Martins B, Pinto CS, Silva MJ. Deep neural models for ICD-10 coding of death certificates and autopsy reports in free-text. J Biomed Inform. 2018;80:64–77. Epub 2018/03/03. doi: 10.1016/j.jbi.2018.02.011 29496630.

34. Duarte F, Martins B, Pinto CS, Silva MJ, editors. A Deep Learning Method for ICD-10 Coding of Free-Text Death Certificates. 2017; Cham: Springer International Publishing.

35. O’Donnell JK, Halpin J, Mattson CL, Goldberger BA, Gladden RM. Deaths involving fentanyl, fentanyl analogs, and U-47700–10 States, July-December 2016. MMWR Morb Mortal Wkly Rep [Internet]. 2017; 66(43):[1197–202 pp.]. doi: 10.15585/mmwr.mm6643e1 29095804

36. Concheiro-Guisan M, Chesser R, Pardi J, Cooper G. Postmortem toxicology of new synthetic opioids. Front Pharmacol. 2018;9:1210. doi: 10.3389/fphar.2018.01210 30416445

37. Gerace E, Salomone A, Vincenti M. Analytical approaches in fatal intoxication cases involving new synthetic opioids. Curr Pharm Biotechno. 2018;19(2):113–23.

38. Honnibal M, Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. 2017.

39. National Center for Health Statistics. Redacted Death Certificate Literal Text File. Centers for Disease Control and Prevention. 2019.

40. Centers for Disease Control and Prevention. Death investigation—United States, 1987. MMWR Morb Mortal Wkly Rep. 1989;38(1):1. 2491906

Článek vyšel v časopise


2019 Číslo 10