A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank

English version

Autoři: Junyang Qian ^aff001; Yosuke Tanigawa ^aff002; Wenfei Du ^aff001; Matthew Aguirre ^aff002; Chris Chang ^aff003; Robert Tibshirani ^aff001; Manuel A. Rivas ^aff002; Trevor Hastie ^aff001
Působiště autorů: Department of Statistics, Stanford University, Stanford, CA, United States of America ^aff001; Department of Biomedical Data Science, Stanford University, Stanford, CA, United States of America ^aff002; Grail, Inc., Menlo Park, CA, United States of America ^aff003
Vyšlo v časopise: A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet 16(10): e32767. doi:10.1371/journal.pgen.1009141
Kategorie: Research Article
doi: https://doi.org/10.1371/journal.pgen.1009141

Souhrn

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ₁-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ₁/ℓ₂ penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

Klíčová slova:

Algorithms – Asthma – Body Mass Index – Genetics – Genome-wide association studies – Heredity – Hypercholesterolemia – Single nucleotide polymorphisms

Zdroje

1. Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition. Springer series in statistics. Springer-Verlag; 2009.

2. Efron B, Hastie T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. vol. 5. Cambridge University Press; 2016.

3. Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. Commun ACM. 2008;51(1):107–113. doi: 10.1145/1327452.1327492

4. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10. Berkeley, CA, USA: USENIX Association; 2010. p. 10–10. Available from: http://dl.acm.org/citation.cfm?id=1863103.1863113.

5. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. TensorFlow: A System for Large-scale Machine Learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. OSDI’16. Berkeley, CA, USA: USENIX Association; 2016. p. 265–283. Available from: http://dl.acm.org/citation.cfm?id=3026877.3026899.

6. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996;58(1):267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x

7. R Core Team. R: A Language and Environment for Statistical Computing; 2017. Available from: https://www.R-project.org/.

8. Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent Journal of Statistical Software; 2010. http://dx.doi.org/10.18637/jss.v033.i01 20808728

9. Breheny P, Huang J. Coordinate Descent Algorithms for Nonconvex Penalized Regression, with Applications to Biological Feature Selection. The Annals of Applied Statistics. 2011;5(1):232–253. doi: 10.1214/10-AOAS388

10. Hastie T. Statistical Learning with Big Data; 2015. Presentation at Data Science at Stanford Seminar. Available from: https://web.stanford.edu/~hastie/TALKS/SLBD_new.pdf.

11. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank Resource with Deep Phenotyping and Genomic Data. Nature. 2018;562(7726):203–209. doi: 10.1038/s41586-018-0579-z 30305743

12. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. The American Journal of Human Genetics. 2017;101(1):5–22. doi: 10.1016/j.ajhg.2017.06.005 28686856

13. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4(1). doi: 10.1186/s13742-015-0047-8 25722852

14. Purcell S, Chang C. PLINK 1.9; 2015. Available from: www.cog-genomics.org/plink/1.9/.

15. Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, et al. Strong Rules for Discarding Predictors in Lasso-Type Problems. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2012;74(2):245–266. doi: 10.1111/j.1467-9868.2011.01004.x 25506256

16. Boyd S, Vandenberghe L. Convex Optimization. Cambridge university press; 2004.

17. Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Articles. 2010;33(1):1–22. 20808728

18. Cox DR. Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B (Methodological). 1972;34(2):187–220. doi: 10.1111/j.2517-6161.1972.tb00899.x

19. Li R, Chang C, Justesen JM, Tanigawa Y, Qian J, Hastie T, et al. Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank. Biostatistics, kxaa038 2020 doi: 10.1093/biostatistics/kxaa038 32989444

20. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(2):301–320. doi: 10.1111/j.1467-9868.2005.00503.x

21. Lello L, Avery SG, Tellier L, Vazquez AI, de los Campos G, Hsu SDH. Accurate Genomic Prediction of Human Height. Genetics. 2018;210(2):477–497. doi: 10.1534/genetics.118.301267 30150289

22. DeBoever C, Tanigawa Y, Lindholm ME, McInnes G, Lavertu A, Ingelsson E, et al. Medical Relevance of Protein-Truncating Variants across 337,205 Individuals in the UK Biobank Study. Nature Communications. 2018;9(1):1612. doi: 10.1038/s41467-018-03910-9 29691392

23. Wold H. Soft Modelling by Latent Variables: The Non-Linear Iterative Partial Least Squares (NIPALS) Approach. Journal of Applied Probability. 1975;12(S1):117–142. doi: 10.1017/S0021900200047604

24. Meinshausen N. Relaxed Lasso. Computational Statistics & Data Analysis. 2007;52(1):374–393. doi: 10.1016/j.csda.2006.12.019

25. Tanigawa Y, Li J, Justesen JM, Horn H, Aguirre M, DeBoever C, et al. Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology. Nature communications. 2019;10(1):4064. doi: 10.1038/s41467-019-11953-9 31492854

26. Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications. 2019;10(1):1776 doi: 10.1038/s41467-019-09718-5 30992449

27. Lloyd-Jones LR, Zeng J, Sidorenko J, Yengo L, Moser G, Kemper KE, et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nature Communications. 2019;10(1):1776. doi: 10.1038/s41467-019-12653-0 30992449

28. Purcell S, Chang C. PLINK 2.0; 2020. Available from: www.cog-genomics.org/plink/2.0/.

29. Zeng J, De Vlaming R, Wu Y, Robinson MR, Lloyd-Jones LR, Yengo L, et al. Signatures of negative selection in the genetic architecture of human complex traits. Nature Genetics. 2018;50(5):746–753. doi: 10.1038/s41588-018-0101-4 29662166

30. Silventoinen K, Sammalisto S, Perola M, Boomsma DI, Cornes BK, Davis C, et al. Heritability of Adult Body Height: A Comparative Study of Twin Cohorts in Eight Countries. Twin Research. 2003;6(5):399–408. doi: 10.1375/136905203770326402 14624724

31. Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, Cornes BK, et al. Assumption-Free Estimation of Heritability from Genome-Wide Identity-by-Descent Sharing between Full Siblings. PLOS Genetics. 2006;2(3):e41. doi: 10.1371/journal.pgen.0020041 16565746

32. Visscher PM, McEvoy B, Yang J. From Galton to GWAS: Quantitative Genetics of Human Height. Genetics Research. 2010;92(5-6):371–379. doi: 10.1017/S0016672310000571 21429269

33. Zaitlen N, Kraft P, Patterson N, Pasaniuc B, Bhatia G, Pollack S, et al. Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits. PLOS Genetics. 2013;9(5):e1003520. doi: 10.1371/journal.pgen.1003520 23737753

34. Hemani G, Yang J, Vinkhuyzen A, Powell J, Willemsen G, Hottenga JJ, et al. Inference of the Genetic Architecture Underlying BMI and Height with the Use of 20,240 Sibling Pairs. The American Journal of Human Genetics. 2013;93(5):865–875. doi: 10.1016/j.ajhg.2013.10.005 24183453

35. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs Explain a Large Proportion of the Heritability for Human Height. Nature Genetics. 2010;42:565. doi: 10.1038/ng.608 20562875

36. Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AAE, Lee SH, et al. Genetic Variance Estimation with Imputed Variants Finds Negligible Missing Heritability for Human Height and Body Mass Index. Nature Genetics. 2015;47:1114. doi: 10.1038/ng.3390 26323059

37. Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, et al. Hundreds of Variants Clustered in Genomic Loci and Biological Pathways Affect Human Height. Nature. 2010;467:832. doi: 10.1038/nature09410 20881960

38. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the Role of Common Variation in the Genomic and Biological Architecture of Adult Human Height. Nature Genetics. 2014;46:1173. doi: 10.1038/ng.3097 25282103

39. Marouli E, Graff M, Medina-Gomez C, Lo KS, Wood AR, Kjaer TR, et al. Rare and Low-Frequency Coding Variants Alter Human Adult Height. Nature. 2017;542:186. doi: 10.1038/nature21039 28146470

40. Parikh N, Boyd S. Proximal Algorithms. Foundations and Trends in Optimization. 2014;1(3):127–239. doi: 10.1561/2400000003

41. Xiao L. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. Journal of Machine Learning Research. 2010;11(88):2543–2596.

42. Duchi JC, Agarwal A, Wainwright MJ. Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling. IEEE Transactions on Automatic Control. 2012;57(3):592–606. doi: 10.1109/TAC.2011.2161027

43. Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. Ann Statist. 2009;37(4):1705–1732. doi: 10.1214/08-AOS620

44. Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine learning research. 2006;7(90):2541–2563.

45. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics. 1988;44(3):837–845. doi: 10.2307/2531595

46. Cortes C, Mohri M. Confidence intervals for the area under the ROC curve. In: Advances in Neural Information Processing Systems; 2005. p. 305–312.

47. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies. Nature Genetics. 2006;38:904. doi: 10.1038/ng1847

48. Patterson N, Price AL, Reich D. Population Structure and Eigenanalysis. PLOS Genetics. 2006;2(12):1–20. doi: 10.1038/ng1847 16862161

49. Kane MJ, Emerson J, Weston S. Scalable Strategies for Computing with Massive Data. Journal of Statistical Software. 2013;55(14):1–19. doi: 10.18637/jss.v055.i14

50. Sobel E, Lange K, Wu TT, Hastie T, Chen YF. Genome-Wide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics. 2009;25(6):714–721. doi: 10.1093/bioinformatics/btp041 19176549

51. El Ghaoui L, Viallon V, Rabbani T. Safe Feature Elimination for the Lasso and Sparse Supervised Learning Problems. arXiv preprint arXiv:10094219. 2010;.

52. Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(5):849–911. doi: 10.1111/j.1467-9868.2008.00674.x

53. Wang J, Wonka P, Ye J. Lasso Screening Rules via Dual Polytope Projection. Journal of Machine Learning Research. 2015;16:1063–1101.

54. Zeng Y, Breheny P. The biglasso Package: A Memory-and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. arXiv preprint arXiv:170105936. 2017;.

55. Privé F, Blum MGB, Aschard H, Ziyatdinov A. Efficient Analysis of Large-Scale Genome-Wide Data with Two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34(16):2781–2787. doi: 10.1093/bioinformatics/bty185 29617937

56. Huling JD, Qian PZ. Fast Penalized Regression and Cross Validation for Tall Data with the oem Package. arXiv preprint arXiv:180109661. 2018;.

57. Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, Jackson AU, et al. Association Analyses of 249,796 Individuals Reveal 18 New Loci Associated with Body Mass Index. Nature Genetics. 2010;42:937. doi: 10.1038/ng.686 20935630

58. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic Studies of Body Mass Index Yield New Insights for Obesity Biology. Nature. 2015;518:197. doi: 10.1038/nature14177 25673413

59. Turner SD. qqman: An R Package for Visualizing GWAS Results Using Q-Q and Manhattan Plots. Journal of Open Source Software. 2018;3(25):731. doi: 10.21105/joss.00731

Článek Comparing DNA replication programs reveals large timing shifts at centromeres of endocycling cells in maize roots

Článek A single Ho-induced double-strand break at the MAT locus is lethal in Candida glabrata

Článek A Rad51-independent pathway promotes single-strand template repair in gene editing

Článek Novel loci for childhood body mass index and shared heritability with adult cardiometabolic traits

Článek C. elegans CLASP/CLS-2 negatively regulates membrane ingression throughout the oocyte cortex and is required for polar body extrusion

Článek The O-GlcNAc transferase OGT is a conserved and essential regulator of the cellular and organismal response to hypertonic stress

Článek Drosophila phosphatidylinositol-4 kinase fwd promotes mitochondrial fission and can suppress Pink1/parkin phenotypes

Článek A new domestic cat genome assembly based on long sequence reads empowers feline genomic medicine and identifies a novel gene for dwarfism

Článek Chromosome separation during Drosophila male meiosis I requires separase-mediated cleavage of the homolog conjunction protein UNO

Článek How noncrossover homologs are conjoined and segregated in Drosophila male meiosis I: Stable but reversible homolog linkers require a novel Separase target protein

Článek Predominance of positive epistasis among drug resistance-associated mutations in HIV-1 protease

Článek Human ABCB1 with an ABCB11-like degenerate nucleotide binding site maintains transport activity by avoiding nucleotide occlusion