UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts

Autoři: Alex Diaz-Papkovich aff001;  Luke Anderson-Trocmé aff002;  Chief Ben-Eghan aff002;  Simon Gravel aff002
Působiště autorů: Quantitative Life Sciences, McGill University, Montreal, Québec, Canada aff001;  McGill University and Genome Quebec Innovation Centre, Montreal, Québec, Canada aff002;  Department of Human Genetics, McGill University, Montreal, Quebec, Canada aff003
Vyšlo v časopise: UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLoS Genet 15(11): e1008432. doi:10.1371/journal.pgen.1008432
Kategorie: Research Article
doi: 10.1371/journal.pgen.1008432


Human populations feature both discrete and continuous patterns of variation. Current analysis approaches struggle to jointly identify these patterns because of modelling assumptions, mathematical constraints, or numerical challenges. Here we apply uniform manifold approximation and projection (UMAP), a non-linear dimension reduction tool, to three well-studied genotype datasets and discover overlooked subpopulations within the American Hispanic population, fine-scale relationships between geography, genotypes, and phenotypes in the UK population, and cryptic structure in the Thousand Genomes Project data. This approach is well-suited to the influx of large and diverse data and opens new lines of inquiry in population-scale datasets.

Klíčová slova:

African people – Caribbean – Data visualization – Ethnicities – Europe – Hispanic people – Chinese people – principal component analysis


1. Lawson DJ, Hellenthal G, Myers S, Falush D (2012) Inference of population structure using dense haplotype data. PLOS Genetics 8(1):e1002453. doi: 10.1371/journal.pgen.1002453 22291602

2. Novembre J, Peter BM (2016) Recent advances in the study of fine-scale population structure in humans. Current Opinion in Genetics & Development 41:98–105. doi: 10.1016/j.gde.2016.08.007

3. Spence JP, Steinrücken M, Terhorst J, Song YS (2018) Inference of population history using coalescent hmms: review and outlook. Current Opinion in Genetics & Development 53:70–76. doi: 10.1016/j.gde.2018.07.002

4. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLOS Genetics 2(12):1–20. doi: 10.1371/journal.pgen.0020190

5. Hellenthal G, et al. (2014) A genetic atlas of human admixture history. Science 343(6172):747–751. doi: 10.1126/science.1243518 24531965

6. McVean G (2009) A genealogical interpretation of principal components analysis. PLOS Genetics 5(10):e1000686. doi: 10.1371/journal.pgen.1000686 19834557

7. Brisbin A, et al. (2012) PCAdmix: principal components-based assignment of ancestry along each chromosome in individuals with admixed ancestry from two or more populations. Human Biology 84(4):343. doi: 10.3378/027.084.0401 23249312

8. Novembre J, et al. (2008) Genes mirror geography within Europe. Nature 456:98–101. doi: 10.1038/nature07331 18758442

9. Nelson MR, et al. (2008) The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. The American Journal of Human Genetics 83(3):347–358. doi: 10.1016/j.ajhg.2008.08.005 18760391

10. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9(Nov):2579–2605.

11. Platzer A (2013) Visualization of SNPs with t-SNE. PLOS One 8(2):e56883. doi: 10.1371/journal.pone.0056883 23457633

12. 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68. doi: 10.1038/nature15393 26432245

13. Li W, Cerise JE, Yang Y, Han H (2017) Application of t-SNE to human genetic data. Journal of Bioinformatics and Computational Biology 15(04):1750017. doi: 10.1142/S0219720017500172 28718343

14. McInnes L, Healy J (2018) UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

15. Becht E, et al. (2018) Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology. doi: 10.1038/nbt.4314 30531897

16. Juster FT, Suzman R (1995) An overview of the Health and Retirement Study. Journal of Human Resources pp. S7–S56. doi: 10.2307/146277

17. Sudlow C, et al. (2015) UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Medicine 12(3):e1001779. doi: 10.1371/journal.pmed.1001779 25826379

18. Reich D, Thangaraj K, Patterson N, Price AL, Singh L (2009) Reconstructing indian population history. Nature 461:489 EP –. doi: 10.1038/nature08365 19779445

19. 23andMe (2019) 23andme tests new ancestry breakdown in central and south asia. [Online; accessed 2019-04-04].

20. Han E, et al. (2017) Clustering of 770,000 genomes reveals post-colonial population structure of north america. Nature Communications 8:14238. doi: 10.1038/ncomms14238 28169989

21. Jordan I, Rishishwar L, Conley AB (2018) Cryptic Native American ancestry recapitulates population-specific migration and settlement of the continental United States. bioRxiv.

22. Leslie S, et al. (2015) The fine-scale genetic structure of the British population. Nature 519(7543):309. doi: 10.1038/nature14230 25788095

23. Robinson MR, et al. (2015) Population genetic differentiation of height and body mass index across Europe. Nature Genetics 47(11):1357. doi: 10.1038/ng.3401 26366552

24. Komlos A (1994) Stature, living standards, and economic development: Essays in anthropometric history. (University of Chicago Press).

25. Quanjer PH, et al. (2012) Multi-ethnic reference values for spirometry for the 3–95-yr age range: the global lung function 2012 equations.

26. Ortega VE, Kumar R (2015) The effect of ancestry and genetic variation on lung function predictions: what is “normal” lung function in diverse human populations? Current Allergy and Asthma Reports 15(4):16. doi: 10.1007/s11882-015-0516-2 26130473

27. Novembre J, Stephens M (2008) Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40(5):646. doi: 10.1038/ng.139 18425127

28. Purcell S, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3):559–575. doi: 10.1086/519795 17701901

29. Baharian S, et al. (2016) The great migration and African-American genomic diversity. PLOS Genetics 12(5):e1006059. doi: 10.1371/journal.pgen.1006059 27232753

30. Maples BK, Gravel S, Kenny EE, Bustamante CD (2013) RFMix: A discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet 93(2):278–288. doi: 10.1016/j.ajhg.2013.06.020 23910464

31. Pedregosa F, et al. (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830.

32. Jones E, Oliphant T, Peterson P, et al. (2001–) SciPy: Open source scientific tools for Python. [Online; accessed 2018-02-02].

33. Seabold S, Perktold J (2010) Statsmodels: Econometric and statistical modeling with python in 9th Python in Science Conference.

34. R Core Team (2013) R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria).

35. Hunter JD (2007) Matplotlib: A 2d graphics environment. Computing In Science & Engineering 9(3):90–95. doi: 10.1109/MCSE.2007.55

36. Wickham H (2016) ggplot2: Elegant Graphics for Data Analysis. (Springer-Verlag New York).

Genetika Reprodukční medicína

Článek vyšel v časopise

PLOS Genetics

2019 Číslo 11

Nejčtenější v tomto čísle
Kurzy Podcasty Doporučená témata Časopisy
Zapomenuté heslo

Nemáte účet?  Registrujte se

Zapomenuté heslo

Zadejte e-mailovou adresu, se kterou jste vytvářel(a) účet, budou Vám na ni zaslány informace k nastavení nového hesla.


Nemáte účet?  Registrujte se