The Univariate Flagging Algorithm (UFA): An interpretable approach for predictive modeling

Autoři: Mallory Sheth aff001;  Albert Gerovitch aff001;  Roy Welsch aff001;  Natasha Markuzon aff002
Působiště autorů: Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America aff001;  The Charles Stark Draper Laboratory, Cambridge, Massachusetts, United States of America aff002
Vyšlo v časopise: PLoS ONE 14(10)
Kategorie: Research Article
doi: 10.1371/journal.pone.0223161


In many data classification problems, a number of methods will give similar accuracy. However, when working with people who are not experts in data science such as doctors, lawyers, and judges among others, finding interpretable algorithms can be a critical success factor. Practitioners have a deep understanding of the individual input variables but far less insight into how they interact with each other. For example, there may be ranges of an input variable for which the observed outcome is significantly more or less likely. This paper describes an algorithm for automatic detection of such thresholds, called the Univariate Flagging Algorithm (UFA). The algorithm searches for a separation that optimizes the difference between separated areas while obtaining a high level of support. We evaluate its performance using six sample datasets and demonstrate that thresholds identified by the algorithm align well with published results and known physiological boundaries. We also introduce two classification approaches that use UFA and show that the performance attained on unseen test data is comparable to or better than traditional classifiers when confidence intervals are considered. We identify conditions under which UFA performs well, including applications with large amounts of missing or noisy data, applications with a large number of inputs relative to observations, and applications where incidence of the target is low. We argue that ease of explanation of the results, robustness to missing data and noise, and detection of low incidence adverse outcomes are desirable features for clinical applications that can be achieved with relatively simple classifier, like UFA.

Klíčová slova:

Algorithms – Body temperature – Death rates – Machine learning – Machine learning algorithms – Medical doctors – Sepsis – Support vector machines


1. Donoho D, Jin J. Higher Criticism Thresholding: Optimal Feature Selection when Useful Features are Rare and Weak. Proc Natl Acad Sci U S A. 2008 Sep 30;105(39):14790–5. doi: 10.1073/pnas.0807471105 18815365

2. Dettling M. BagBoosting for tumor classification with gene expression data. Bioinformatics. 2004;20:3583–3593. doi: 10.1093/bioinformatics/bth447 15466910

3. Zhao SD, Parmigiani G, Huttenhower C, Waldron L. Más-o-menos: a simple sign averaging method for discrimination in genomic data analysis. Bioinformatics. 2014 Nov 1;30(21): 3062–3069. doi: 10.1093/bioinformatics/btu488 25061068

4. Lakkaraju H, Bach SH, Leskovec J. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In: Krishnapuram B, editor. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13–17; San Francisco, California. New York: ACM; 2016. p. 1675–84.

5. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer Science+Business Media; 2009.

6. Witten IH, Frank E. Data mining: practical machine learning tools and techniques. 2nd ed. San Francisco: Morgan Kaufmann Publishers; 2005.

7. Eick CF, Zeidat N, Zhao Z. Supervised clustering–algorithms and benefits. ICTAI 2004: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence; 2004 Nov 15–17; Boca Raton, FL. Washington DC: IEEE Computer Society; 2004. p. 774–6.

8. Williams B, Mandrekar JN, Mandrekar SJ, Cha SS, Furth AF. Finding optimal cutpoints for continuous covariates with binary and time-to-event outcomes. Rochester (MN): Mayo Clinic, Department of Health Sciences Research; 2006 Jun.

9. Mazumdar M, Glassman JR. Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer treatments. Stat Med. 2000 Jan 15;19(1): 113–32. doi: 10.1002/(sici)1097-0258(20000115)19:1<113::aid-sim245>;2-o 10623917

10. Baum RL, Godt JW. Early warning of rainfall-induced shallow landslides and debris flows in the USA. Landslides. 2010 Sep;7(3): 259–272.

11. Martina MLV, Todini E, Libralon A. A Bayesian decision approach to rainfall thresholds based flood warning. Hydrol Earth Syst Sci. 2006;10: 413–426.

12. Kalil A. Septic shock clinical presentation. Medscape [Internet]. 2014 Oct 20 [cited 2015 Mar 16].

13. Dellinger RP, Levy MM, Rhodes A, Annane D, Gerlach H, Opal SM, et al. Surviving sepsis campaign: international guidelines for management of severe sepsis and septic shock: 2012. Crit Care Med. 2013;41(2): 580–637. doi: 10.1097/CCM.0b013e31827e83af 23353941

14. Friedman JH, Fisher NI. Bump hunting in high-dimensional data. Statistics and Computing. 1999; 9:123–143.

15. Rice JA. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury: Brooks/Cole; 2007.

16. Lichman M. Breast Cancer Wisconsin (Original) Data Set; 1992. Database: UCI Machine learning Repository. [Internet]

17. Lichman M. Pima Indians Diabetes Data Set; 2013. Database: UCI Machine learning Repository [Internet].

18. Kapouleas I, Weiss SM. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. Readings in Machine Learning. 1990; 177–183.

19. Michie D, Spiegelhalter DJ, Taylor CC, editors. Machine learning, neural and statistical classification. New Jersey: Ellis Horwood; 1994.

20. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression. Science. 1999 Oct 15;286(5439):531–7. doi: 10.1126/science.286.5439.531 10521349

21. Broad Institute. Molecular classification of cancer: class discovery and class prediction by gene expression. Database: Cancer Program Legacy Publication Resources [Internet].

22. Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman LW, Moody G, et al. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A public-access intensive care unit database. Crit Care Med. 2011;39(5):952–960. doi: 10.1097/CCM.0b013e31820a92c6 21283005

23. MedlinePlus [Internet]. National Institutes of Health; c2015 [cited 2015 Apr 6].

24. R Core Team (2016). R: A language and environment for statistical computing. R Foundation for statistical Computing, Vienna, Austria.

25. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F, Chang C, et al. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. 2015. R package 1.6–7.

26. Therneau T, Atkinson B, Ripley B. rpart: Recursive Partitioning and Regression Trees. 2015. R package version 4.1–10.

27. Liaw A, Wiener M. Classification and Regression by random forest. 2002. R News 2(3), 18–22.

28. Chleborad AF, Baum RL, Godt JW. Rainfall thresholds for forecasting landslides in the Seattle, Washington, Area-Exceedance and Probability. U.S. Geological Survey Open-File Report. 2006.

29. Yanmin S, Wong AKC, Kamel MS. Classification of imbalanced data: a review. Intern J Pattern Recognit Artif Intell. 2009;23(4):687–719.

30. Weiss GM. Mining with rarity: a unifying framework. ACM SIGKDD Explorations. 2004; 6(1):7–15.

31. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 1995;57(1), 289–300.

32. Leskovec Y, Rajaraman A, Ullman J. Mining of Massive Datasets. 2nd ed. Cambridge: Cambridge University Press; 2014.

Článek vyšel v časopise


2019 Číslo 10