Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models

Autoři: Sahir R. Bhatnagar aff001;  Yi Yang aff003;  Tianyuan Lu aff004;  Erwin Schurr aff006;  JC Loredo-Osti aff007;  Marie Forest aff008;  Karim Oualkacha aff009;  Celia M. T. Greenwood aff001
Působiště autorů: Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Quebec, Canada aff001;  Department of Diagnostic Radiology, McGill University, Montréal, Québec, Canada aff002;  Department of Mathematics and Statistics, McGill University, Montréal, Québec, Canada aff003;  Quantitative Life Sciences, McGill University, Montreal, Québec, Canada aff004;  Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada aff005;  Department of Medicine, McGill University, Montréal, Québec, Canada aff006;  Department of Mathematics and Statistics, Memorial University, St. John’s, Newfoundland and Labrador, Canada aff007;  École de Technologie Supérieure, Montréal, Québec, Canada aff008;  Département de Mathématiques, Université du Québec à Montréal, Montréal, Québec, Canada aff009;  Gerald Bronfman Department of Oncology, McGill University, Montréal, Québec, Canada aff010;  Department of Human Genetics, McGill University, Montreal, Quebec, Canada aff011
Vyšlo v časopise: Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLoS Genet 16(5): e32767. doi:10.1371/journal.pgen.1008766
Kategorie: Research Article
doi: 10.1371/journal.pgen.1008766


Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effects models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives or false negatives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects’ relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM with a single random effect called ggmix for simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. ggmix can be used to construct polygenic risk scores and select instrumental variables in Mendelian randomization studies. Our algorithms are available in an R package available on CRAN (

Klíčová slova:

Algorithms – Covariance – Genetic loci – Genome-wide association studies – Mathematical models – Molecular genetics – Simulation and modeling – Variant genotypes


