Designing machine learning workflows with an application to topological data analysis

Autoři: Eric Cawi aff001;  Patricio S. La Rosa aff002;  Arye Nehorai aff001
Působiště autorů: Preston M. Green Department of Electrical and Systems Engineering, Washington University in St. Louis, St. Louis, MO, United States of America aff001;  Global IT Analytics, Crop Science Division, Bayer Company, Saint Louis, MO, United States of America aff002
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article
doi: 10.1371/journal.pone.0225577


In this paper we define the concept of the Machine Learning Morphism (MLM) as a fundamental building block to express operations performed in machine learning such as data preprocessing, feature extraction, and model training. Inspired by statistical learning, MLMs are morphisms whose parameters are minimized via a risk function. We explore operations such as composition of MLMs and when sets of MLMs form a vector space. These operations are used to build a machine learning workflow from data preprocessing to final task completion. We examine the Mapper Algorithm from Topological Data Analysis as an MLM, and build several workflows for binary classification incorporating Mapper on Hospital Readmissions and Credit Evaluation datasets. The advantage of this framework lies in the ability to easily build, organize, and compare multiple workflows, and allows joint optimization of parameters across multiple steps in an application.

Klíčová slova:

Algorithms – Machine learning – Machine learning algorithms – Optimization – principal component analysis – Support vector machines – Vector spaces


1. Vapnik V. The nature of statistical learning theory. Springer science & business media; 2013.

2. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. Tensorflow: a system for large-scale machine learning. In: OSDI. vol. 16; 2016. p. 265–283.

3. Ball D. Induction by a Hilbert hypercube representation. Aston University; 1991.

4. Singh G, Mémoli F, Carlsson GE. Topological methods for the analysis of high dimensional data sets and 3d object recognition. In: SPBG; 2007. p. 91–100.

5. Miescke KJ, Liese F. Statistical Decision Theory: Estimation, Testing, and Selection.

6. Pednault EP. Statistical learning theory. Citeseer; 1997.

7. Vapnik VN. An overview of statistical learning theory. IEEE transactions on neural networks. 1999;10(5):988–999. doi: 10.1109/72.788640 18252602

8. Xuegong Z. Introduction to statistical learning theory and support vector machines. Acta Automatica Sinica. 2000;26(1):32–42.

9. Nasrabadi NM. Pattern recognition and machine learning. Journal of electronic imaging. 2007;16(4):049901. doi: 10.1117/1.2819119

10. Vapnik V. Principles of risk minimization for learning theory. In: Advances in neural information processing systems; 1992. p. 831–838.

11. Olson RS, Moore JH. TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning. Springer; 2018. p. 163–173.

12. Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum M, Hutter F. Auto-sklearn: Efficient and Robust Automated Machine Learning. Springer; 2018. p. 123–143.

13. Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K. Auto-WEKA: Automatic model selection and hyperparameter optimization in WEKA. Springer; 2018. p. 89–103.

14. Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2016.

15. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–357. doi: 10.1613/jair.953

16. Lunardon N, Menardi G, Torelli N. ROSE: A Package for Binary Imbalanced Learning. R journal. 2014;6(1). doi: 10.32614/RJ-2014-008

17. Raiber F, Kurland O. Kullback-Leibler Divergence Revisited. In: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval. ICTIR’17. New York, NY, USA: ACM; 2017. p. 117–124. Available from:

18. Rish I, et al. An empirical study of the naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence. vol. 3. IBM New York; 2001. p. 41–46.

19. Barandiaran, Iñigo. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell; volume 20; number 8; 1-22, 1998.

20. Cerda P, Varoquaux G, Kégl B. Similarity encoding for learning with dirty categorical variables. Machine Learning. 2018; p. 1–18.

21. Patel N, Upadhyay S. Study of various decision tree pruning methods with their empirical comparison in WEKA. International journal of computer applications. 2012;60(12). doi: 10.5120/9744-4304

22. Rojas R. AdaBoost and the super bowl of classifiers a tutorial introduction to adaptive boosting. Freie University, Berlin, Tech Rep. 2009.

23. Mehrotra K, Mohan CK, Ranka S. Elements of artificial neural networks. MIT press; 1997.

24. Data Preparation and Feature Engineering in ML; 2018. Available from:

25. Fawcett T. An introduction to ROC analysis. Pattern recognition letters. 2006;27(8):861–874. doi: 10.1016/j.patrec.2005.10.010

26. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Selected papers of hirotugu akaike. Springer; 1998. p. 199–213.

27. Wasserman L. Topological data analysis. Annual Review of Statistics and Its Application. 2018;5:501–532. doi: 10.1146/annurev-statistics-031017-100045

28. Bubenik P. Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research. 2015;16(1):77–102.

29. Gamble J, Heo G. Exploring uses of persistent homology for statistical analysis of landmark-based shape data. Journal of Multivariate Analysis. 2010;101(9):2184–2199. doi: 10.1016/j.jmva.2010.04.016

30. Bendich P, Marron JS, Miller E, Pieloch A, Skwerer S. Persistent homology analysis of brain artery trees. The annals of applied statistics. 2016;10(1):198. doi: 10.1214/15-AOAS886 27642379

31. Gholizadeh S, Seyeditabari A, Zadrozny W. Topological Signature of 19th Century Novelists: Persistence Homology in Context-Free Text Mining. 2018.

32. Duponchel L. Exploring hyperspectral imaging data sets with topological data analysis. Analytica chimica acta. 2018;1000:123–131. doi: 10.1016/j.aca.2017.11.029 29289301

33. Nicolau M, Levine AJ, Carlsson G. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences. 2011; p. 201102826. doi: 10.1073/pnas.1102826108

34. Coudriau M, Lahmadi A, François J. Topological analysis and visualisation of network monitoring data: Darknet case study. In: Information Forensics and Security (WIFS), 2016 IEEE International Workshop on. IEEE; 2016. p. 1–6.

35. Carriere M, Michel B, Oudot S. Statistical analysis and parameter selection for Mapper. The Journal of Machine Learning Research. 2018;19(1):478–516.

36. Epstein C, Carlsson G, Edelsbrunner H. Topological data analysis. Inverse Problems. 2011;27(12):120201. doi: 10.1088/0266-5611/27/12/120201

37. Gao X, Xiao B, Tao D, Li X. A survey of graph edit distance. Pattern Analysis and applications. 2010;13(1):113–129. doi: 10.1007/s10044-008-0141-y

38. Menardi G, Torelli N. Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery. 2014;28(1):92–122. doi: 10.1007/s10618-012-0295-5

39. Rahman A, Verma B. Novel layered clustering-based approach for generating ensemble of classifiers. IEEE Transactions on Neural Networks. 2011;22(5):781–792. doi: 10.1109/TNN.2011.2118765 21486714

40. Kuhn M. The caret Package; 2009.

41. Dheeru D, Karra Taniskidou E. UCI Machine Learning Repository; 2017. Available from:

Článek vyšel v časopise


2019 Číslo 12