A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices

Autoři: James A. Watson aff001;  Aimee R. Taylor aff003;  Elizabeth A. Ashley aff002;  Arjen Dondorp aff001;  Caroline O. Buckee aff003;  Nicholas J. White aff001;  Chris C. Holmes aff006
Působiště autorů: Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand aff001;  Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom aff002;  Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA aff003;  Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA aff004;  Lao-Oxford-Mahosot Hospital Wellcome Trust Research Unit, Vientiane, Laos aff005;  Department of Statistics, University of Oxford, Oxford, United Kingdom aff006;  Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom aff007
Vyšlo v časopise: A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices. PLoS Genet 16(10): e32767. doi:10.1371/journal.pgen.1009037
Kategorie: Research Article
doi: https://doi.org/10.1371/journal.pgen.1009037


Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results.

Klíčová slova:

DNA recombination – Genetic epidemiology – Genetics – Machine learning algorithms – Malaria – Malarial parasites – Plasmodium – Population genetics


1. Wesolowski A, Taylor AR, Chang HH, Verity R, Tessema S, Bailey JA, et al. Mapping malaria by combining parasite genomic and epidemiologic data. BMC Medicine. 2018;16(1):190. doi: 10.1186/s12916-018-1181-9 30333020

2. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38(8):904. doi: 10.1038/ng1847

3. Pritchard JK, Feldman MW. Statistics for microsatellite variation based on coalescence. Theoretical Population Biology. 1996;50(3):325–344. doi: 10.1006/tpbi.1996.0034

4. Lawson DJ, Hellenthal G, Myers S, Falush D. Inference of Population Structure using Dense Haplotype Data. PLoS Genetics. 2012;8(1):e1002453. doi: 10.1371/journal.pgen.1002453

5. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research. 2009;19(9):1655–1664. doi: 10.1101/gr.094052.109

6. Baton LA, Ranford-Cartwright LC. Spreading the seeds of million-murdering death: metamorphoses of malaria in the mosquito. Trends in Parasitology. 2005;21(12):573–580.

7. Zhu SJ, Hendry JA, Almagro-Garcia J, Pearson RD, Amato R, Miles A, et al. The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. Elife. 2019;8:e40845. doi: 10.7554/eLife.40845 31298657

8. Miotto O, Almagro-Garcia J, Manske M, MacInnis B, Campino S, Rockett KA, et al. Multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. Nature Genetics. 2013;45(6):648. doi: 10.1038/ng.2624 23624527

9. Amato R, Pearson RD, Almagro-Garcia J, Amaratunga C, Lim P, Suon S, et al. Origins of the current outbreak of multidrug-resistant malaria in southeast Asia: a retrospective genetic study. Lancet Infectious Diseases. 2018;18(3):337–345. doi: 10.1016/S1473-3099(18)30068-9 29398391

10. Hamilton WL, Amato R, van der Pluijm RW, Jacob CG, Quang HH, Thuy-Nhien NT, et al. Evolution and expansion of multidrug-resistant malaria in southeast Asia: a genomic epidemiology study. Lancet Infectious diseases. 2019;0(0).

11. McVean G. A genealogical interpretation of principal components analysis. PLoS Genetics. 2009;5(10):e1000686. doi: 10.1371/journal.pgen.1000686

12. Taylor AR, Jacob PE, Neafsey DE, Buckee CO. Estimating relatedness between malaria parasites. Genetics. 2019; p. genetics–302120.

13. Verity R, Aydemir O, Brazeau NF, Watson OJ, Hathaway NJ, Mwandagalirwa MK, et al. The impact of antimalarial resistance on the genetic structure of Plasmodium falciparum in the DRC. Nature Communications. 2020;11(1):1–10. doi: 10.1038/s41467-020-15779-8

14. Ashley EA, Dhorda M, Fairhurst RM, Amaratunga C, Lim P, Suon S, et al. Spread of artemisinin resistance in Plasmodium falciparum malaria. New England Journal of Medicine. 2014;371(5):411–423. doi: 10.1056/NEJMoa1314981 25075834

15. Miotto O, Amato R, Ashley EA, MacInnis B, Almagro-Garcia J, Amaratunga C, et al. Genetic architecture of artemisinin-resistant Plasmodium falciparum. Nature Genetics. 2015;47(3):226. doi: 10.1038/ng.3189 25599401

16. Imwong M, Suwannasin K, Kunasol C, Sutawong K, Mayxay M, Rekol H, et al. The spread of artemisinin-resistant Plasmodium falciparum in the Greater Mekong subregion: a molecular epidemiology observational study. Lancet Infectious Diseases. 2017;17(5):491–497. doi: 10.1016/S1473-3099(17)30048-8 28161569

17. Imwong M, Hien TT, Thuy-Nhien NT, Dondorp AM, White NJ. Spread of a single multidrug resistant malaria parasite lineage (PfPailin) to Vietnam. Lancet Infectious Diseases. 2017;17(10):1022–1023. doi: 10.1016/S1473-3099(17)30524-8

18. van der Pluijm RW, Imwong M, Chau NH, Hoa NT, Thuy-Nhien NT, Thanh NV, et al. Determinants of dihydroartemisinin-piperaquine treatment failure in Plasmodium falciparum malaria in Cambodia, Thailand, and Vietnam: a prospective clinical, pharmacological, and genetic study. Lancet Infectious Diseases. 2019;19(9):952–961. doi: 10.1016/S1473-3099(19)30391-3 31345710

19. World Health Organization. Guidelines for the treatment of malaria. 2015.

20. Scornavacca C, Zickmann F, Huson DH. Tanglegrams for rooted phylogenetic trees and networks. Bioinformatics. 2011;27(13):i248–i256. doi: 10.1093/bioinformatics/btr210

21. De Vienne DM. Tanglegrams are misleading for visual evaluation of tree congruence. Molecular Biology and Evolution. 2019;36(1):174–176. doi: 10.1093/molbev/msy196

22. Behr M, Ansari MA, Munk A, Holmes C. Testing for dependence on tree structures. Proceedings of the National Academy of Sciences. 2020;117(18):9787–9792. doi: 10.1073/pnas.1912957117

23. Robinson WS. A Method for Chronologically Ordering Archaeological Deposits. American Antiquity. 1951;16(4):293–301. doi: 10.2307/276978

24. Hahsler M, Hornik K, Buchta C. Getting things in order: an introduction to the R package seriation. Journal of Statistical Software. 2008;25(3):1–34.

25. Schaffner SF, Taylor AR, Wong W, Wirth DF, Neafsey DE. hmmIBD: software to infer pairwise identity by descent between haploid genotypes. Malaria Journal. 2018;17(1):196. doi: 10.1186/s12936-018-2349-7

26. Henden L, Lee S, Mueller I, Barry A, Bahlo M. Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens. PLoS genetics. 2018;14(5):e1007279. doi: 10.1371/journal.pgen.1007279

27. Auburn S, Benavente ED, Miotto O, Pearson RD, Amato R, Grigg MJ, et al. Genomic analysis of a pre-elimination Malaysian Plasmodium vivax population reveals selective pressures and changing transmission dynamics. Nature Communications. 2018;9(1):1–12. doi: 10.1038/s41467-018-04965-4

28. Leslie S, Winney B, Hellenthal G, Davison D, Boumertit A, Day T, et al. The fine-scale genetic structure of the British population. Nature. 2015;519(7543):309–314. doi: 10.1038/nature14230 25788095

29. Taylor AR, Schaffner SF, Cerqueira GC, Nkhoma SC, Anderson TJ, Sriprawat K, et al. Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent. PLoS Genetics. 2017;13(10):e1007065. doi: 10.1371/journal.pgen.1007065 29077712

30. Taylor AR, Echeverry DF, Anderson TJC, Neafsey DE, Buckee CO. Identity-by-descent relatedness estimates with uncertainty characterise departure from isolation-by-distance between Plasmodium falciparum populations on the Colombian-Pacific coast. [Preprint] bioRxiv. 2020.

31. Speidel L, Forest M, Shi S, Myers SR. A method for genome-wide genealogy estimation for thousands of samples. Nature Genetics. 2019;51(9):1321–1329. doi: 10.1038/s41588-019-0484-x

32. Anderson E, Dunham K. The influence of family groups on inferences made with the program Structure. Molecular Ecology Resources. 2008;8(6):1219–1229. doi: 10.1111/j.1755-0998.2008.02355.x

33. Lawson DJ, Van Dorp L, Falush D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nature Communications. 2018;9(1):3258. doi: 10.1038/s41467-018-05257-7

34. Pacheco MA, Forero-Peña DA, Schneider KA, Chavero M, Gamardo A, Figuera L, et al. Malaria in Venezuela: changes in the complexity of infection reflects the increment in transmission intensity. Malaria Journal. 2020;19(1):176. doi: 10.1186/s12936-020-03247-z 32380999

35. Sánchez-Pacheco SJ, Kong S, Pulido-Santacruz P, Murphy RW, Kubatko L. Median-joining network analysis of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary. Proceedings of the National Academy of Sciences. 2020;117(23):12518–12519. doi: 10.1073/pnas.2007062117

36. Feynman RP, Leighton R. “Surely you’re joking, Mr. Feynman!”: adventures of a curious character. Random House; 1992.

37. Stark PB, Saltelli A. Cargo-cult statistics and scientific crisis. Significance. 2018;15(4):40–43. doi: 10.1111/j.1740-9713.2018.01174.x

38. Saltelli A. A short comment on statistical versus mathematical modelling. Nature Communications. 2019;10(1):1–3.

39. Manske M, Miotto O, Campino S, Auburn S, Almagro-Garcia J, Maslen G, et al. Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing. Nature. 2012;487(7407):375–379. doi: 10.1038/nature11174 22722859

40. Redmond SN, MacInnis BM, Bopp S, Bei AK, Ndiaye D, Hartl DL, et al. De novo mutations resolve disease transmission pathways in clonal malaria. Molecular Biology and Evolution. 2018;35(7):1678–1689. doi: 10.1093/molbev/msy059 29722884

41. MalariaGEN Plasmodium falciparum Community Project. Genomic epidemiology of artemisinin resistant malaria. eLife. 2016;5:e08714. doi: 10.7554/eLife.08714 26943619

42. Amambua-Ngwa A, Amenga-Etego L, Kamau E, Amato R, Ghansah A, Golassa L, et al. Major subpopulations of Plasmodium falciparum in sub-Saharan Africa. Science. 2019;365(6455):813–816. doi: 10.1126/science.aav5427 31439796

43. Maaten Lvd, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9(Nov):2579–2605.

44. Schrider DR, Kern AD. Supervised Machine Learning for Population Genetics: A New Paradigm. Trends in Genetics. 2018;34(4):301–312. https://doi.org/10.1016/j.tig.2017.12.005. 29331490

45. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genetics. 2006;2(12):e190. doi: 10.1371/journal.pgen.0020190

46. Nguyen LH, Holmes S. Ten quick tips for effective dimensionality reduction. PLoS Computational Biology. 2019;15(6):e1006907. doi: 10.1371/journal.pcbi.1006907

47. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987;4(4):406–425.

48. Kong S, Sánchez-Pacheco SJ, Murphy RW. On the use of median-joining networks in evolutionary biology. Cladistics. 2016;32(6):691–699. doi: 10.1111/cla.12147

49. R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/.

50. Müllner D. fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python. Journal of Statistical Software. 2013;53(9):1–18.

51. Galili T. dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. 2015.

52. Henden L, Wakeham D, Bahlo M. XIBD: software for inferring pairwise identity by descent on the X chromosome. Bioinformatics. 2016;32(15):2389–2391. doi: 10.1093/bioinformatics/btw124

53. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77(2):257–286. doi: 10.1109/5.18626

54. Daniels RF, Schaffner SF, Wenger EA, Proctor JL, Chang HH, Wong W, et al. Modeling malaria genomics reveals transmission decline and rebound in Senegal. Proceedings of the National Academy of Sciences. 2015;112(22):7067–7072. doi: 10.1073/pnas.1505691112

Článek vyšel v časopise

PLOS Genetics

2020 Číslo 10
Nejčtenější tento týden
Nejčtenější v tomto čísle

Zvyšte si kvalifikaci online z pohodlí domova

Sekvenční léčba schizofrenie
nový kurz
Autoři: MUDr. Jana Hořínková

Hypertenze a hypercholesterolémie – synergický efekt léčby
Autoři: prof. MUDr. Hana Rosolová, DrSc.

Svět praktické medicíny 5/2023 (znalostní test z časopisu)

Imunopatologie? … a co my s tím???
Autoři: doc. MUDr. Helena Lahoda Brodská, Ph.D.

Multidisciplinární zkušenosti u pacientů s diabetem
Autoři: Prof. MUDr. Martin Haluzík, DrSc., prof. MUDr. Vojtěch Melenovský, CSc., prof. MUDr. Vladimír Tesař, DrSc.

Všechny kurzy
Kurzy Podcasty Doporučená témata Časopisy
Zapomenuté heslo

Zadejte e-mailovou adresu, se kterou jste vytvářel(a) účet, budou Vám na ni zaslány informace k nastavení nového hesla.


Nemáte účet?  Registrujte se