Machine learning algorithm validation with a limited sample size

Autoři: Andrius Vabalas aff001;  Emma Gowen aff002;  Ellen Poliakoff aff002;  Alexander J. Casson aff001
Působiště autorů: Materials, Devices and Systems Division, School of Electrical and Electronic Engineering, The University of Manchester, Manchester, England, United Kingdom aff001;  School of Biological Sciences, The University of Manchester, Manchester, England, United Kingdom aff002
Vyšlo v časopise: PLoS ONE 14(11)
Kategorie: Research Article
doi: 10.1371/journal.pone.0224365


Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Klíčová slova:

Algorithms – Autism – Gaussian noise – Kernel functions – Learning curves – Machine learning – Neuroimaging – Normal distribution


1. Raudys SJ, Jain AK. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1991 Mar;13(3):252–264. doi: 10.1109/34.75512

2. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age PLoS Medicine. 2015 Mar;12(3):e1001779 doi: 10.1371/journal.pmed.1001779 25826379

3. Arbabshirani MR, Plis S, Sui J, Calhoun VD. Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls. NeuroImage. 2016 Jan;145:137–165. doi: 10.1016/j.neuroimage.2016.02.079 27012503

4. Varoquaux G. Cross-validation failure: Small sample sizes lead to large error bars. NeuroImage. 2018 Oct;180:68–77. doi: 10.1016/j.neuroimage.2017.06.061 28655633

5. Combrisson E, Jerbi K. Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy. Journal of Neuroscience Methods. 2015 Jul;250:126–36. doi: 10.1016/j.jneumeth.2015.01.010 25596422

6. Kanal L, Chandrasekaran B. On dimensionality and sample size in statistical pattern classification. Pattern Recognition. 1971 Oct;3(3):225–34. doi: 10.1016/0031-3203(71)90013-6

7. Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006 Dec;7(1):91. doi: 10.1186/1471-2105-7-91 16504092

8. Jain AK, Chandrasekaran B. 39 Dimensionality and sample size considerations in pattern recognition practice. Handbook of Statistics. 1982 Jan;2:835–55. doi: 10.1016/S0169-7161(82)02042-2

9. Cawley GC, Talbot NL. On over-fitting in model selection and subsequent selection bias in performance evaluation. Machine Learning Research. 2010 Jul;11:2079–107.

10. Stone M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological). 1974 Jan;36(2):111–33.

11. Krstajic D, Buturovic LJ, Leahy DE, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of Cheminformatics. 2014 Dec;6(1):1–15. doi: 10.1186/1758-2946-6-10

12. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory. 1992 Jul;144-152.

13. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002 Jan;46(1-3):389–422. doi: 10.1023/A:1012487302797

14. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nature Reviews Genetics. 2015 Jun;16(6):321. doi: 10.1038/nrg3920 25948244

15. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007 Oct;23(19):2507–17. doi: 10.1093/bioinformatics/btm344 17720704

16. Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2004 Nov;21(8):1509–15. doi: 10.1093/bioinformatics/bti171 15572470

17. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011 Apr;2(3):27.

18. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011 Oct;12:2825–30.

19. Devos O, Ruckebusch C, Durand A, Duponchel L, Huvenne JP. Support vector machines (SVM) in near infrared (NIR) spectroscopy: Focus on parameters optimization and model interpretation. Chemometrics and Intelligent Laboratory Systems. 2009 Mar;96(1):27–33. doi: 10.1016/j.chemolab.2008.11.005

20. Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics. 2015;2015. doi: 10.1155/2015/198363 26170834

21. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez JM, Herrera F. A review of microarray datasets and applied feature selection methods. Information Sciences. 2014 Oct;282:111–35. doi: 10.1016/j.ins.2014.05.042

22. Dernoncourt D, Hanczar B, Zucker JD. Analysis of feature selection stability on high dimension and small sample data. Computational Statistics & Data Analysis. 2014 Mar;71:681–93. doi: 10.1016/j.csda.2013.07.012

23. Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making. 2012 Dec;12(1):8. doi: 10.1186/1472-6947-12-8 22336388

24. Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, et al. Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology. 2003 Apr;10(2):119–42. doi: 10.1089/106652703321825928 12804087

25. Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample size planning for classification models. Analytica Chimica Acta. 2013 Jan;760:25–33. doi: 10.1016/j.aca.2012.11.007 23265730

26. Hyde KK, Novack MN, LaHaye N, Parlett-Pelleriti C, Anden R, Dixon DR, et al. Applications of Supervised Machine Learning in Autism Spectrum Disorder Research: a Review. Review Journal of Autism and Developmental Disorders. 2019 Jun;6(2):128–46. doi: 10.1007/s40489-019-00158-x

27. Varoquaux G, Raamana PR, Engemann DA, Hoyos-Idrobo A, Schwartz Y, Thirion B. Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. 2017 Jan;145:166–79. doi: 10.1016/j.neuroimage.2016.10.038 27989847

28. Bone D, Goodwin MS, Black MP, Lee CC, Audhkhasi K, Narayanan S. Applying machine learning to facilitate autism diagnostics: pitfalls and promises. Journal of Autism and Developmental Disorders. 2015 May;45(5):1121–36. doi: 10.1007/s10803-014-2268-6 25294649

29. Kassraian-Fard P, Matthis C, Balsters JH, Maathuis MH, Wenderoth N. Promises, pitfalls, and basic guidelines for applying machine learning classifiers to psychiatric imaging data, with autism as an example. Frontiers in Psychiatry. 2016 Dec;7:177. doi: 10.3389/fpsyt.2016.00177 27990125

Článek vyšel v časopise


2019 Číslo 11