Effectively Identifying eQTLs from Multiple Tissues by Combining Mixed Model and Metaanalytic Approaches
Gene expression data, in conjunction with information on genetic variants, have enabled studies to identify expression quantitative trait loci (eQTLs) or polymorphic locations in the genome that are associated with expression levels. Moreover, recent technological developments and cost decreases have further enabled studies to collect expression data in multiple tissues. One advantage of multiple tissue datasets is that studies can combine results from different tissues to identify eQTLs more accurately than examining each tissue separately. The idea of aggregating results of multiple tissues is closely related to the idea of metaanalysis which aggregates results of multiple genomewide association studies to improve the power to detect associations. In principle, metaanalysis methods can be used to combine results from multiple tissues. However, eQTLs may have effects in only a single tissue, in all tissues, or in a subset of tissues with possibly different effect sizes. This heterogeneity in terms of effects across multiple tissues presents a key challenge to detect eQTLs. In this paper, we develop a framework that leverages two popular metaanalysis methods that address effect size heterogeneity to detect eQTLs across multiple tissues. We show by using simulations and multiple tissue data from mouse that our approach detects many eQTLs undetected by traditional eQTL methods. Additionally, our method provides an interpretation framework that accurately predicts whether an eQTL has an effect in a particular tissue.
Published in the journal:
. PLoS Genet 9(6): e32767. doi:10.1371/journal.pgen.1003491
Category:
Research Article
doi: 10.1371/journal.pgen.1003491
Summary
Gene expression data, in conjunction with information on genetic variants, have enabled studies to identify expression quantitative trait loci (eQTLs) or polymorphic locations in the genome that are associated with expression levels. Moreover, recent technological developments and cost decreases have further enabled studies to collect expression data in multiple tissues. One advantage of multiple tissue datasets is that studies can combine results from different tissues to identify eQTLs more accurately than examining each tissue separately. The idea of aggregating results of multiple tissues is closely related to the idea of metaanalysis which aggregates results of multiple genomewide association studies to improve the power to detect associations. In principle, metaanalysis methods can be used to combine results from multiple tissues. However, eQTLs may have effects in only a single tissue, in all tissues, or in a subset of tissues with possibly different effect sizes. This heterogeneity in terms of effects across multiple tissues presents a key challenge to detect eQTLs. In this paper, we develop a framework that leverages two popular metaanalysis methods that address effect size heterogeneity to detect eQTLs across multiple tissues. We show by using simulations and multiple tissue data from mouse that our approach detects many eQTLs undetected by traditional eQTL methods. Additionally, our method provides an interpretation framework that accurately predicts whether an eQTL has an effect in a particular tissue.
Introduction
Advances in genotyping and gene expression technologies have enabled researchers to study associations between genetic variants and gene expression levels. These studies often treat expression levels as quantitative traits and apply statistical tests to identify genomic locations known as expression Quantitative Trait Loci (eQTLs) that segregate the traits. Genomewide maps of eQTLs for several organisms including budding yeast [1], [2], Arabidopsis [3], mouse [4], [5] and human [6], [7] have been successfully generated. Furthermore, recent technological developments and cost decreases in microarrays allow studies to collect expression data in more than one tissue in human [6], [8], [9] and mouse [4], [5]. A collection of expression data from multiple tissues enables studies to explore the tissuespecific nature of eQTLs as well as their global effects on different types of tissues.
Multiple tissue datasets can potentially allow studies to more effectively identify eQTLs by combining information from multiple tissues. Due to a limited sample size, a standard single tissue eQTL method or “tissuebytissue” approach that examines each tissue individually may not detect an eQTL in any one tissue, or it may overestimate the proportion of tissue specific eQTLs [10]. However, if a genetic variant is associated with the expression of a gene in more than one tissue, we can aggregate information from multiple tissues to increase statistical power. This idea is similar to the idea of metaanalysis in genomewide association studies (GWAS) that combines results of several studies on the same phenotype. In our case, each tissue is considered as a separate “study” in the metaanalysis.
One key difficulty in combining results from multiple tissues is that it is not known in which tissues a genetic variant has an effect. For example, a variant may influence gene expression in all tissues, may have different effects on different tissues, or may have an effect in some tissues but may not have any effect in other tissues. This phenomenon, different effect sizes among tissues, is called heterogeneity. Metaanalysis methods have different assumptions on the distribution of effect sizes, and to better detect eQTLs, studies will perform best if they apply a metaanalysis method whose assumptions are consistent with the actual effect sizes of eQTLs in multiple tissues. For instance, if an eQTL has an effect in all tissues, studies would perform best if they utilize the fixedeffects model (FE) [11]–[13] that assumes no heterogeneity. On the other hand, to effectively detect an eQTL whose effects on gene expression differ across tissues, studies will perform best if they apply the randomeffects model (RE) [14]–[18] that considers heterogeneity.
Another challenge in applying metaanalysis to multitissue datasets is that studies often collect multiple tissues from the same individuals, which may cause the expression between tissues of the same individual to be correlated. This correlation may cause false positives for standard metaanalysis methods which assume a disjoint set of individuals in each study.
In this paper, we present a novel approach called “MetaTissue” that identifies eQTLs from multiple tissues by utilizing metaanalysis. The critical advance of our methodology is that we extend metaanalysis to a mixed model framework. We apply the mixed model to account for the correlation of expression between tissues, and perform metaanalysis to combine results from multiple tissues. Since we do not know in advance the distribution of effect sizes for eQTLs among different tissues, we utilize both the FE and RE models to identify as many eQTLs as possible, and for RE, we use a recently developed randomeffects model [18] that achieves higher statistical power than the traditional randomeffects model. We first show by simulations that MetaTissue is more powerful than the tissuebytissue approach in detecting eQTLs when eQTLs have effects in multiple tissues, while controlling for the false positive rate correctly.
We then apply MetaTissue to a mouse expression dataset. This dataset is ideal for evaluating methods for discovering eQTLs for several reasons. The data are generated through a cross which limits the genetic diversity in the dataset, and all variants have similar frequencies which eliminate effects of allele frequency on power. In addition, the dataset contains gene expression from many different tissues and different numbers of individuals for the tissues, allowing us to compare results between different scenarios. We analyze four tissues from 50 samples per each tissue and ten tissues from 22 samples. We apply MetaTissue to both datasets and demonstrate that MetaTissue detects many eQTLs that are undetected by the tissuebytissue method.
In addition to accurately detecting eQTLs from multiple tissues, our method can also predict whether an eQTL affects or does not affect expression in a specific tissue. Predicting the existence or absence of an effect is a very difficult problem in metaanalysis, and it is known that making predictions based on pvalues is not effective [19]. One of the reasons is that a nonsignificant pvalue is not necessarily evidence of an absence of an effect since the study may be underpowered. Our method instead computes the posterior probability of the presence or absence of an effect for each study building on recent work in interpretation of metaanalysis [19]. Applying the framework to the four and ten tissue datasets, we identify more eQTLs that are predicted to have effects in all tissues compared to the pvalue based approach, which are interesting potential candidates with possible global regulatory mechanisms. MetaTissue is publicly available at http://genetics.cs.ucla.edu/metatissue/.
Results
MetaTissue
The main idea of MetaTissue is that it combines the effect size estimates from multiple tissues using a “metaanalysis” approach. Metaanalysis techniques are widely applied to combine the results of GWAS studies. In our case, we consider each tissues as a “study.” This has the advantage of increasing the statistical power to detect eQTLs shared across tissues. There are several challenges corresponding to the inherent differences between combining GWAS studies and expression quantitative trait loci studies in multiple tissues. The first challenge is that we expect that there may be differences in effect sizes between tissues. For this reason, we utilize both the randomeffects model which allows MetaTissue to detect eQTLs when heterogeneity is present, and the fixedeffects model when it is not. A second challenge is that in many multitissue eQTL study designs, multiple tissues are collected from the same individuals which induce correlation between measurements of expression levels in different tissues. However, metaanalysis methods assume that studies are independent and may be susceptible to false positives. To overcome this challenge, we utilize the linear mixed model to correct our effect size estimates before performing the metaanalysis.
We assume that multitissue eQTL studies collect expression values of genes from individuals in tissues. However, those individuals are not necessarily the same for all tissues; some individuals may provide a subset of tissues. The studies also collect genotype information of SNPs from the individuals. To determine eQTLs in a specific tissue, or pairs of SNP and gene that are significantly correlated, eQTL studies often use the following linear model.
where is gene expression of individuals in tissue , is information on SNP , and is a vector of ones. is the effect size of SNP on gene in tissue , and if it is not zero, we claim the pair of SNP and gene as an eQTL. The TissueByTissue (TBT) approach computes for every tissue (), and determines whether at least one is not zero.
To increase the statistical power to detect eQTLs, MetaTissue utilizes metaanalysis that combines from tissues. A naive approach to apply metaanalysis to multitissue eQTL datasets is directly using computed from the linear model for TBT. This approach, however, violates the main assumption of metaanalysis that is independent for tissues. Because multiple tissues are often collected from the same individuals, there exists correlation between gene expression values across different tissues, and this leads to correlated .
To apply metaanalysis to correlated , MetaTissue uses a linear mixed model to explicitly capture correlation between :
where and contain gene expression and SNP information in all tissues, and Figure 1 shows how they are encoded using a simple example. is the random effect of a mixed model due to the fact that multiple tissues are collected from the same individuals. follows the multivariate normal distribution whose covariance matrix ( matrix in Figure 1) represents sharing of individuals in multiple tissues. MetaTissue applies the generalized least squares to estimate and its covariance or correlation between . MetaTissue “uncorrelates” using the covariance it estimated and use the “uncorrelated” for metaanalysis (see the Materials and Methods section for more details).
There is a fundamental difference between MetaTissue and the TBT approach. The statistical test in MetaTissue tests whether or not a gene is involved in an eQTL in any of the tissues. In other words, the null hypothesis of MetaTissue assumes that no effect is present in any of the tissues for a specific gene. A rejection of this null hypothesis is effectively predicting the presence of an effect in at least one of the tissues. However, the tissuebytissue approach tests whether or not an eQTL is present in each tissue. Hence, the null hypothesis of TBT assumes that no effect is present in a specific tissue. This means that MetaTissue performs one test per gene and TBT performs one test per gene in each tissue. In our comparisons of MetaTissue and TBT, we adjust the significant thresholds so that the overall false positive rate of implicating any tissue of a gene in an eQTL is constant for both methods.
Once we identify a significant association using MetaTissue, this means that at least one of the tissues contains an eQTL. In order to identify which subset of the tissues contain an eQTL, we utilize a recently developed metaanalysis interpretation framework which computes an mvalue statistic for each tissue [19]. The mvalue estimates the posterior probability that an effect is present in a study included in a metaanalysis. Utilizing the mvalues, we can predict tissues in which an effect is present.
Power comparison by simulation
We first simulate gene expression data to compare the power between the traditional TissueByTissue approach (TBT), MetaTissue FE, and MetaTissue RE. We create a dataset that has 100 individuals with one SNP and one gene expression level simulating one eQTL. We set the minor allele frequency to 30%. We simulate four tissues and consider four scenarios where a SNP has the same effect in (1) a single tissue, (2) in two tissues, (3) in three tissues, and (4) in all four tissues. The first three scenarios correspond to eQTLs with heterogeneity while eQTLs have no heterogeneity in the last scenario. We check statistics [20] of eQTLs that measure the magnitude of heterogeneity in each scenario and verify that eQTLs have high levels of heterogeneity in the first three scenarios, but very low levels in the last scenario (Figure S1). We assume that each individual provides four tissues, and hence this simulation corresponds to a repeated measures design. We use the mixed model discussed in the Materials and Methods section to generate the gene expression levels of individuals while taking into account the repeated measures design. We generate 1,000 datasets (each a potential eQTL) and the power is estimated as a proportion of eQTLs detected at a significance threshold of for metaanalysis methods. We choose this threshold because the number of tests we perform in mouse datasets is on the order of one million (135 SNPs10,588 genes). The significance threshold adjusted for one million tests as in typical GWAS is . For TBT, we apply a significance threshold of such that the overall false positive rate of TBT is the same as that for MetaTissue as discussed in the previous section.
To apply the proposed methods to the simulations, we use the following approach. For TBT, we perform a standard Ftest using a linear model to obtain a pvalue for each pair of a SNP and a gene expression level in each tissue (see Materials and Methods). The tissuebytissue approach declares a SNPgene expression pair as an eQTL if the pvalue for the association statistic is below the threshold for any one of the tissues. For MetaTissue, we first perform generalized least squares (GLS) to correct for the fact that individuals are shared among tissues. MetaTissue then combines information from multiple tissues to obtain either fixed effect or random effect metaanalysis pvalues as described in the Materials and Methods section. A SNPexpression pair is considered as an eQTL if its metaanalysis pvalue is below the significance threshold. As a separate simulation, we verify that both of our implementations (MetaTissue FE and RE) control the false positive rates (Text S1). This simulation also shows that utilizing the mixed model is critical for controlling false positives when expression levels from multiple tissues are collected from the same individual.
Figure 2 shows that MetaTissue methods are more powerful than TBT when effects exist in multiple tissues; MetaTissue RE is the most powerful when an eQTL has effects in two or three tissues, and MetaTissue FE outperforms TBT and MetaTissue RE when the effects exist in all tissues. The TBT approach has higher power than MetaTissue methods when the effects exist in a single tissue. These results show that TBT is an ideal approach to detect an eQTL that is specific to a certain tissue while MetaTissue approaches are ideal for detecting an eQTL that has effects in more than one tissue. As the number of tissues with effects increases, the power of MetaTissue methods increases while that of TBT decreases. These results suggest an integrated approach in eQTL studies to apply TBT for detecting tissuespecific eQTLs and MetaTissue methods for detecting eQTLs shared between tissues.
Simulation of heterogeneity in multiple tissues using mouse data
To verify the results of the previous power simulation in real multiple tissue data, we simulate heterogeneity using a liver tissue expression from mouse. This dataset contains 108 samples, 135 SNPs and 10,588 probe expression levels. We detect 389 eQTLs in this single tissue dataset using the standard linear model with a pvalue threshold of , which corresponds to the false discovery rate (FDR) of 0.017% level. We consider these detected associations as the gold standard for measuring accuracy of methods in this simulation. We then split the 108 samples into three groups of 36 samples to simulate three tissues, and this means that eQTLs have effects in all three tissues. In our simulations, we expect to find fewer eQTLs because each of our “tissues” only has 36 samples compared to the original 108 samples. We then consider three scenarios similar to scenarios in the previous power simulation; (1) eQTLs have effects only in the first tissue by permuting expression of the second and third tissues, (2) eQTLs have effects only in the first and second tissues by permuting expression of the third tissue, and (3) eQTLs have effects in all three tissues without any permutation. Permuting the expression of a specific tissue removes effects of eQTLs from the tissue, and hence allows simulation of heterogeneity. We apply MetaTissue FE, MetaTissue RE, and TBT to this multiple tissue dataset and measure how many eQTLs out of the original 389 eQTLs each method can recover using the same threshold ( for TBT). Because the number of eQTLs methods recover can change depending on how we split the 108 samples, we perform ten iterations of the experiment where we divide individuals differently in each iteration, and average the results.
The result of this simulation shows that MetaTissue methods recover the most eQTLs when eQTLs have effects in more than one tissue (Figure 3). When effects exist in two out of three tissues, MetaTissue RE recovers the most eQTLs; it recovers 144 eQTLs out of the 389 eQTLs on average, and this is 27% and 133% more than the number of eQTLs MetaTissue FE and TBT recover, respectively. When eQTLs have effects in all tissues, MetaTissue FE recovers the most eQTLs, and when effects exist in a single tissue, TBT does. This result is consistent with the previous power simulation in which MetaTissue methods were more powerful than TBT when eQTLs have effects in multiple tissues.
Detecting eQTLs in multiple tissue mouse data
We apply MetaTissue to detect eQTLs in multiple tissues from mouse. Our data consists of two sets; one with four tissues (cortex, heart, liver, spleen), and the other with ten tissues (bone marrow, hippocampus, kidney, pancreas, stomach, white fat, and the four tissues). The four tissue dataset has 50 samples per each tissue while the ten tissue dataset has 22 samples per tissue. In both datasets, not all individuals provided all different types of tissues; on average, 34% of individuals are shared between two tissues in the four tissue dataset while 11% of individuals are shared in the ten tissues dataset. The number of SNPs (135 SNPs) and the number of probes (10,588) are the same as those of the liver tissue.
Figures 4A (four tissues) and 4B (ten tissues) show the number of eQTLs detected by MetaTissue RE, MetaTissue FE, and TBT using a threshold of (/the number of tissues for TBT). The number substantially increases by using MetaTissue RE or FE, showing up to two fold and twelve fold increases compared to TBT in the four and ten tissue datasets, respectively. These results indicate that methods that combine results of multiple tissues outperform a method that uses results of each tissue separately as all metaanalysis methods detect more eQTLs than TBT. Moreover, these results suggest a possibility that there exist a considerable number of eQTLs with different effect sizes across tissues as MetaTissue RE consistently identifies more eQTLs than MetaTissue FE. In addition to the number of eQTLs (SNPexpression pairs), we also analyze the number of eSNPs (unique SNPs influencing gene expression) and eProbes (unique probes for gene expression). Similar to the results of the number of eQTLs, MetaTissue detects more eSNPs and eProbes than TBT (Figure 5).
Another important implication comes from comparing the two datasets. TBT finds substantially fewer number of eQTLs in the ten tissue dataset than in the four tissue dataset. This is possibly because the sample size of each tissue is decreased from 50 to 22. On the other hands, the metaanalytic methods find more eQTLs. One possible reason is that the total sample size is slightly increased from 200 to 220. Therefore, the results demonstrate that by using information from multiple tissues and leveraging metaanalysis methods, we may be able to detect eQTLs even if the sample size for each tissue is small.
In addition to the number of eQTLs that different methods detect, we also analyze the overlap of eQTLs using Venn diagrams (Figures 4C and 4D). The Venn diagrams show the number of eQTLs detected only by each of the three methods, by both TBT and each of MetaTissue methods, by both MetaTissue methods, and by all three methods. In the four tissue dataset, the three methods detect 493 unique eQTLs overall, and a majority of eQTLs (95.1% of total eQTL) are detected by either of MetaTissue methods. There are, however, 24 eQTLs (4.9% of total eQTLs) that only TBT detects, and they are likely to be tissuespecific eQTLs. In the ten tissue dataset, almost all eQTLs (99.3% of total eQTLs) are detected by MetaTissue RE or FE, and there are 4 eQTLs (0.7% of total eQTLs) detected only by TBT, which may be due to the low statistical power due to the limited number of samples.
Instead of the common genomewide significance threshold (e.g. ) to identify eQTLs, an alternative approach is to use the false discovery rate (FDR) approach, and we use the QVALUE package in R [21] to compute a qvalue for each SNPexpression pair. We consider only ciseQTLs for the FDR approach; we consider an eQTL as cis if a SNP is on the same chromosome as the probe for gene expression. While typical eQTL studies consider 1 Mb as a distance between a SNP and a probe for ciseQTLs, we consider a much longer distance due to a small number of genotyped SNPs (135 SNPs). Figures S2A and S2B show the number of eQTLs detected by MetaTissue methods and TBT using FDR of 0.05 level in four and ten tissues, respectively, and Figures S2C and S2D are Venn digrams showing the overlap of eQTLs. The results using the FDR approach are consistent with those using the common genomewide significance threshold; MetaTissue RE detects most eQTLs among the three methods, and a majority of eQTLs (86% and 93% of total eQTLs for four and ten tissues) are detected either by MetaTissue RE or FE.
Measuring heterogeneity in mouse data
The number of eQTLs detected only by TBT or by RE in Figures 4 and S2 indicates that there can be several eQTLs with different effect sizes in different tissues. To measure the magnitude of heterogeneity of eQTLs, we use the Cochran's Q statistic [14] and the statistic [20]. We make a plot whose xaxis is the statistic and whose yaxis is the log of pvalue of Cochran's Q statistic, and a histogram showing the distribution of statistics. Figures S3, S4, and S5 show the heterogeneity of eQTLs detected by TBT, FE, and RE, respectively, in the four tissues of mouse data. These plots show that the eQTLs detected by RE show higher level of heterogeneity than the eQTLs detected by FE, as expected. Given the pvalue threshold of where is the number of eQTLs detected, 65, 17, and 53 eQTLs show statistically significant heterogeneity in TBT, MetaTissue FE, and MetaTissue RE, respectively, using the pvalue of Cochran's Q statistic.
Predicting the presence of effects in multiple tissue data
Our MetaTissue approach not only detects more eQTLs from multiple tissues but also provides an interpretation framework that predicts whether an eQTL has effects in a specific tissue. MetaTissue computes a statistic called mvalue [19], and it is the posterior probability that an effect exists in a specific tissue. If the mvalue is greater than a threshold , we predict that an effect exists, and if it is less than , we predict that an effect does not exist. Another approach to predict an effect is to use a pvalue. In this approach, an effect exists if a pvalue is less than a significance threshold and does not exist otherwise.
We first apply this prediction framework to the 3way split liver tissue dataset that we previously generated. Recall that the liver tissue has 389 eQTLs, and we simulated three tissues from it and three scenarios in which we varied heterogeneity of eQTLs. For this simulation, we consider only the scenario where eQTLs have effects in the first two tissues out of three since this corresponds to heterogeneity in which the number of eQTLs that TBT and MetaTissue recover is relatively large. We measure how accurately MetaTissue and the pvalue approach predict the presence and absence of effects of the 389 eQTLs in the three tissues. More specifically, MetaTissue makes a correct prediction if mvalues are greater than 0.9 in the first two tissues and the mvalue is less than 0.1 in the third tissue (). We consider an mvalue prediction to be ambiguous if any of the three tissues has the mvalue between 0.1 and 0.9. If the prediction is not either correct or ambiguous, it is considered as an incorrect prediction. For the pvalue approach, pvalues of the first two tissues need to be less than the significance threshold (/3) and pvalue of the third tissue needs to be greater than the threshold for a correct prediction. Otherwise, the prediction is an incorrect prediction since the pvalue approach does not have the notion of the ambiguous prediction. In the original 3way split liver tissue experiment, we had ten simulations which differed in how the individuals were divided. Over the ten simulations, MetaTissue and TBT recovered 146 eQTLs out of total 389 eQTLs on average (Figure 3). Since we use mvalues for the interpretation purpose (not for detecting eQTLs), we apply mvalues to only those 146 eQTLs. We also predict effects of the 146 eQTLs using the pvalue approach.
MetaTissue makes the correct prediction for 35% (51/146) of the eQTLs and predicts the ambiguous prediction for 56% (82/146). The pvalue approach only makes the correct prediction for 11% (16/146) of the eQTLs. The number of correct predictions of MetaTissue is more than three times greater. In addition, given the advantage of the fact that MetaTissue can make ambiguous predictions, the number of incorrect predictions for MetaTissue (13/146) is ten times fewer than that for the pvalue approach (130/146). The results demonstrate that by combining the metaanalysis method and the interpretation framework, we may predict effects of eQTLs more accurately than the approach utilizing pvalues.
We then apply our interpretation framework to the four and ten multiple tissue datasets from mouse to predict effects of eQTLs that were discovered using MetaTissue and TBT (493 and 568 eQTLs in four and ten tissue datasets, respectively). We calculate the mvalue for each eQTL per each tissue and make a prediction that the eQTL affects expression in that tissue if the mvalue is greater than 0.9. We also compare our approach to the pvalue approach as in the previous simulation using the same threshold (/the number of tissues).
First, we apply the two approaches to the four tissue dataset, and Table 1 lists the number of eQTLs predicted to have effects across various combinations of tissues (e.g. eQTLs affecting expression in heart/liver, heart/cortex, heart/liver/cortex). The results show that MetaTissue consistently categorizes more eQTLs having effects in multiple tissues than the pvalue approach. Among those eQTLs, ones that influence expression levels in all tissues are particularly interesting because they may provide insights into the global regulatory mechanisms of eQTLs. MetaTissue predicts 283 such eQTLs while the pvalue approach predicts 15 eQTLs. The small number of predictions in pvalue approach is expected because even if the effect exists in all tissues, given power of tissuebytissue approach, we can predict the global effect only with probability .
We next predict effects of eQTLs in the ten tissue dataset, and for this dataset, we would expect to detect a fewer number of eQTLs having effects across all tissues since it becomes less likely that all pvalues or mvalues pass the threshold as we try to detect effects in more tissues. Table 2 shows the number of eQTLs predicted to affect expression across different numbers of tissues considered (e.g. eQTLs having effects across any two tissues, any three tissues). Similar to the results of the four tissue dataset, MetaTissue predicts more eQTLs with effects in several tissues than the pvalue approach. Unlike the four tissues, we detect a fewer number of eQTLs having effects in all ten tissues; 134 and zero such eQTLs by MetaTissue and the pvalue approach, respectively. The results indicate the intrinsic difficulty in detecting eQTLs influencing expression across many different tissues.
Discussion
We presented a statistically powerful approach to detect eQTLs from multiple tissues. Our approach, MetaTissue, takes advantage of two metaanalysis methods that differ in their assumptions on effects of eQTLs in different tissues. The first method assumes that effects exist in all tissues with the same magnitude, and this assumption allows us to detect eQTLs shared across all tissues. The second method assumes that effect sizes of variants are different among studies. By assuming the heterogeneity, we may be able to accurately describe the nature of eQTLs whose patterns of genetic regulation differ across tissues. Metaanalysis methods, however, assume that studies are independent, and this assumption is unlikely to be true in multitissue dataset since studies collect multiple tissues from the same individuals. This may cause correlation in expression between tissues, and to correct for the correlation, we utilized a mixed model that enables the metaanalysis method to achieve correct false positive rates.
To measure the performance of MetaTissue, we first showed by simulations that our methods are generally more powerful than a naive approach that looks at results of each tissue individually. Next, by using data from mouse liver tissue, we simulated the heterogeneity in effect sizes across a subset of tissues as well as in all tissues. MetaTissue methods were shown to recover more original eQTLs from multiple tissues than the naive tissuebytissue approach when effects exist in multiple tissues. We then observed that MetaTissue detects many eQTLs that the naive approach does not detect in four and ten tissue datasets from mouse. However, we note that there are a few tissuespecific eQTLs that only the naive approach detects, and hence we recommend that eQTL studies also apply the naive approach in addition to MetaTissue.
In addition to detecting more eQTLs, MetaTissue can also accurately predict whether an effect exists in a specific tissue. MetaTissue calculates the posterior probability that an eQTL has an effect in a certain tissue, and we demonstrated that this probability is more effective in predicting the effect than a pvalue is by using the same liver tissue simulation. We then predicted effects of eQTLs that we found in the four and ten tissue datasets and showed our method predicts more eQTLs having effects in multiple tissues than the pvalue approach.
Our approach is fundamentally different from previous approaches that also attempt to detect eQTLs from multiple tissues, and to the best of our knowledge, MetaTissue is the first method to apply both a mixed model and metaanalysis methods to eQTL mapping. A traditional approach to detect associations from repeated measurements from same individuals such as multiple tissue data is MANOVA. However, MANOVA is not directly applicable to our multiple tissue data because not all samples provided all different types of tissues, and hence our data are not completely “repeated measurements.” MetaTissue is more general than MANOVA since MetaTissue can be applied to both “repeated measures design” in which individuals are shared across all tissues and to a scenario in which only a subset of individuals are shared. Another advantage of our method is that MetaTissue can take into account population structure by adding an additional variance component term in our mixed model. This may be important to multiple tissue datasets in which individuals are sampled from different populations, which may cause inflation of false positives.
MetaTissue leverages the recently developed random effects model [18] that achieves higher power than the traditional random effects model [14]–[17]. Han and Eskin showed that the traditional random effects model never achieves higher power than the fixed effects model due to its conservative null hypothesis. We apply the traditional RE to our power simulation (Figure S6), the heterogeneity experiment with the liver tissue (Figure S7), and the four and ten tissue datasets of mouse data (Figure S8), and we observe the same phenomenon; the traditional RE is always less powerful than FE and the recently developed RE.
There are a few other methods that attempt to detect eQTLs from the multiple tissue data such as Sparse Bayesian Multiple Regression and the GFlasso approach proposed by Petretto et al. [22] and Kim et al. [23] However, a key difference between these methods and MetaTissue is that they attempt to detect multiple variants (“multilocus”) associated with multiple traits while our method focuses on an association of a single variant. Another difference and one main advantage of MetaTissue is that since it is a metaanalysis method, studies can combine results of many published eQTL analyses without actual data assuming that those analyses are independent; only results of an eQTL analysis such as effect size estimates are needed when the analyses are independent. MetaTissue has another advantage that it is simpler and more computationally efficient than other methods that involve computationally challenging algorithms such as Bayesian variable selection and regularized linear regression including Lasso. While we applied MetaTissue to the multitissue dataset with a small number of genotyped SNPs and samples (135 SNPs and about a total of 200 samples across tissues), our algorithm and software are efficient enough to be applied to larger eQTL studies where there are hundreds of individuals genotyped at hundreds of thousands SNPs.
Materials and Methods
Mouse strains
F1N2 mice from a C57BL6/N×129/OlaHsd cross were produced as follows. Male ES cell chimeric founders (E14 ES line [24]) were crossed to C57BL6/N females (Harlan Laboratories). Male agouti offspring were backcrossed to C57BL6/N females, and F1N1 offspring were intercrossed to produce F1N2 animals, Figure 6. All animals were maintained in ventilated microisolator caging (Allentown), fed a standard lab chow diet (Harlan Teklad) and provided water ad libidem. F1N2 animals were group housed with littermates until 9 weeks of age. Mice selected for tissue harvest were singly housed for one additional week, to minimize socialization effects. Only males were used, to avoid estrus related effects on gene expression. While the production crosses segregated various gene targeted alleles, all mice selected for this study carried only wild type genomes and did not carry any engineered genomic alterations such as gene knockouts.
Gene expression
Animals were sacrificed by cervical dislocation and immediately dissected. A set of thirty tissues were collected from each animal in a prescribed order, beginning with the pancreas. Each tissue was briefly rinsed in PBS and deposited in RNAlater (Ambion), held at room temperature to allow diffusion of RNAlater into the tissue, and then stored at −86C.
Tissue homogenization, total RNA isolation, cDNA production, in vitro transcription and fluorescent labeling were performed as per Affymetrix gene chip recommended protocols. The hybridization mixes were analyzed using Affymetrix U74Av2 expression microarrays, washed and scanned using Affymetrix instrumentation and protocols.
We consider the probes for which we have annotations. For each tissue type, we filter out array outliers which show an average correlation of with respect to all other arrays.
The mice were genotyped at 140 SNPs that are polymorphic between 129S1/SvImJ and C57BL/6J from the JAX SNP Genotyping Panel [25]. Information on SNPs is listed in Table S1. We use 135 out of the 140 SNPs that are polymorphic in all tissues for our analysis.
Normalization and selection of individuals
In our analysis, we consider the gene expression levels of probes collected in 4 tissues (liver, spleen, cortex and heart) over individuals. To be consistent with the different tissue datasets we analyze, we randomly chose 50 individuals from those datasets that have more than 50 individuals. We first used RMA to perform background adjustment on the raw expressions and then quantile normalization to normalize the adjusted expressions. For 10 tissues, we collect the same number of gene expression levels over individuals.
Power simulation framework
Our power simulation assumes that we collect four tissues from 100 individuals, and considers four scenarios where an eQTL has an effect in (1) one tissue, (2) in two tissues, (3) in three tissues, and (4) in all four tissues. To generate the gene expression level of individuals that considers the repeated measurements from the same individuals, we first sample gene expression from the multivariate normal distribution:
where is a vector of size 400 corresponding to gene expression of 100 individuals in 4 tissues, and where is a 400 by 400 matrix representing correlation between individuals across the tissues. More specifically, if and are the same individual between two tissues, and otherwise. is an identity matrix with size of 400. and are coefficients of the two variance components, and we use the real mouse dataset to obtain realistic values of the two coefficients. We estimate and for every pair between a gene expression and a SNP, and find that on average, and . We use these values for our simulation.
After sampling , we add a SNP effect to for tissues in which an effect exists using the following equation:
where is gene expression of 100 individuals in tissue (), is on tissue (size of 100), and is SNP information of 100 individuals. if an eQTL does not have an effect in tissue , and if an eQTL has an effect. Since the goal is to compare the relative power between methods, we vary the effect size () depending on the scenario to avoid too high or too low power. Specifically, we set for the scenarios (1), (2), (3), (4), respectively.
Linear model for tissuebytissue approach
We assume an additive linear model to represent the relationship between the expression of one gene and one SNP. We can write that relationship in the following way for an arbitrary gene and SNP at tissue :
where is a size vector denoting gene expression levels of individuals, is a size vector denoting SNP, is a vector of ones, and . To assess the significance of an association between a SNP and a gene, we perform a standard Ftest for the null hypothesis and also obtain an estimate of using the lm function in R. In the tissuebytissue approach, if any single tissue turns out to be significant (), the pair of SNP and gene expression are reported as a significant eQTL. TBT can also find tissues in which an eQTL exists by examining which is nonzero.
MetaTissue  linear mixed model
We use a linear mixed model to take into account the fact that eQTL studies collect multiple tissues from the same individuals. This is called a “repeated measures design,” and the mixed model is often used to model the correlation induced by the repeated measurements such as in longitudinal data. Let be the number of tissues, and for simplicity, we assume there are individuals for each tissue, but individuals collected in one tissue do not necessarily completely overlap with those in another tissue; it is possible that some individuals may provide all tissues while others may provide a subset. We also assume that we have SNP information for all individuals. We apply the following linear mixed model to assess the statistical significance between gene expression and SNP :
Here is a description of each variable in above equation. Let .

is an matrix denoting expression levels of individuals in tissues. In other words, the first rows are expression of individuals in the first tissue, the next are expression in the second tissue, and so on. Expression values of each tissue are normalized to .

is an matrix denoting the intercepts for tissues. The first column of denotes the intercept for the first tissue; the first rows are ones, and the next are zeros. In the second column that denotes the intercept for the second tissue, the first rows are zeros, the next rows are ones, and the next rows are zeros.

is a matrix denoting coefficients of intercepts.

is an matrix denoting SNP for tissues. This is similar to the matrix, and we replace ones in the matrix with SNP information. For example, in the first column, the first rows are SNP information of individuals in the first tissue, and the next rows are zeros.

is a matrix denoting coefficients of SNP effects in tissues.

is the random effect of the mixed model due to the repeated measurements of individuals, and where is an matrix representing how individuals are shared across the tissues (discussed in the Power simulation framework section). represents random errors and where is an identity matrix. To efficiently estimate the two variance components ( and ), we use the efficient mixedmodel association (EMMA) package [26].
To estimate and its covariance, we apply the generalized least squares. Let . Then, the estimated is
MetaTissue  metaanalysis
Given the estimate , we combine information from multiple tissues by applying metaanalysis to . If the effect of eQTL is the same for all tissues, applying fixed effects model (FE) metaanalysis will be a powerful approach. If the effects of eQTL differs by tissues, applying random effects model (RE) metaanalysis will be a powerful approach [18].
Fixed effects model
Fixed effects model (FE) is a metaanalysis method that assumes the effect size of a variant is fixed across datasets [12], [13], and its statistic is computed based on the inversevarianceweighted effect size [27]. Let and be the estimates of effectsize and the standard error of , respectively, in tissues. Let be the unknown true effect size. The null hypothesis of FE is ; in other words, effect size in all tissues is zero. A statistic of FE () and its distribution under the null hypothesis are
A pvalue of is obtained from the standard normal distribution.
Random effects model
Our MetaTissue method leverages new random effects model (RE) [18] to detect eQTLs from multiple tissues while taking into account heterogeneity of effect sizes in different tissues. The assumption of the random effects model is that the effect size of a variant is different among datasets and follows a probability distribution with mean and variance . The null hypothesis of the random effects model is equivalent to that of the fixed effects model; that is, . The traditional random effects model, however, assumes a conservative null hypothesis model. The new random effects model corrects this conservative null hypothesis model and outperforms the traditional random effects model. More specifically, a statistic of RE () is defined as
where and are estimated mean and variance of the effect size, and the maximum likelihood estimates of the two parameters are calculated iteratively as following
The initial value of is estimated using approaches in the traditional random effects model [14], [20], [28]. We obtain a pvalue of from pvalue tables that are constructed from numerous null statistics.
Accounting for covariance of effect size estimates
Since we use linear mixed model to account for the fact that multitissue eQTL studies often collect multiple tissues from the same individuals, our estimates of effect size, in Equation (4) can become correlated. The covariance structure is estimated using the standard formula of the generalized least squares,
It is important that the metaanalysis methods account for this covariance structure of effect size estimates.
To take into account the covariance structure in metaanalysis, we use an extension [29] of the Lin and Sullivan approach [30]. Given and their covariance , the optimal fixed effects model metaanalysis statistic is
where is the vector of ones (). The variance of the statistic is given
Note that if is independent ( is a diagonal matrix), and are equivalent to the inversevariance weighted effect size estimate (the numerator of equation (5)) and its variance.
It can be shown that this approach is equivalent to building a new “uncorrelated” variance of ,
and then giving and as input to the traditional metaanalysis approaches assuming independent estimates [29]. This “uncorrelating” idea allows us flexibility to use the correlated estimates in any metaanalysis framework requiring independent estimates. We use and its “uncorrelated” variance for the fixed effects model (which gives equivalent results to the Lin and Sullivan approach [30]), random effects model, heterogeneity estimation ( and ), and the mvalue estimation [19].
Predicting effects of eQTLs in multiple tissues
To predict whether an eQTL has effects in a specific tissue, MetaTissue computes a statistic called the “mvalue” proposed by Han and Eskin [19] that specifies the posterior probability that an effect exists in a tissue. First, we denote as a vector of ; . Let be a random variable whose value is 1 if dataset has an effect and 0 otherwise. We also denote as a vector of , and since each has two values, has possible values. Let be one of those values, and let denote a vector of . To estimate the mvalue , we need to compute the probability, , which is the probability of dataset having effects given the observed effect sizes. We can compute this probability using the Bayes' theorem
where is a set of in which th value is 1. The equation shows that we need to compute and terms for every to compute . We can compute as
where denotes the number of 1's in and denotes the beta function. and are set to one [19]. The probability of given , , is computed as
where
denotes the probability density function of the normal distribution with mean equal to and variance equal to , and and denote the indices of 0 and 1 in , respectively. is the inverse variance, and is the prior for the effect size; when an effect is small while when an effect is large for binary traits [31], [32]. For quantitative traits, there is no general guidelines for the normally distributed priors, so we choose to use the default value . is a scaling factor defined as
More detailed derivations of and terms are discussed in Han and Eskin [19].
Practical issues in combining mixed model and metaanalysis
There are subtle issues in our framework combining mixed model and metaanalysis. First, the effect size estimates from linear model or mixed model are typically distributed, while most of metaanalysis methods assume normally distributed effect sizes. Second, our approach simultaneously considers all tissues using Equation (3), but the error model is slightly different from the tissuebytissue approach in Equation (2). In the tissuebytissue approach, the error is fit in each tissue separately, while in our new approach, the error is fit in all tissues together, which is often less powerful than the former. We correct for these subtle differences using simple heuristics (See Text S2).
Ethics statement
This study was performed in strict accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health.
Supporting Information
Zdroje
1. BremRB, YvertG, ClintonR, KruglyakL (2002) Genetic dissection of transcriptional regulation in budding yeast. Science 296: 752–5.
2. BremRB, KruglyakL (2005) The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci U S A 102: 1572–7.
3. KeurentjesJJB, FuJ, TerpstraIR, GarciaJM, van den AckervekenG, et al. (2007) Regulatory network construction in arabidopsis by using genomewide gene expression quantitative trait loci. Proc Natl Acad Sci U S A 104: 1708–13.
4. CheslerEJ, LuL, ShouS, QuY, GuJ, et al. (2005) Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet 37: 233–42.
5. BystrykhL, WeersingE, DontjeB, SuttonS, PletcherMT, et al. (2005) Uncovering regulatory pathways that affect hematopoietic stem cell function using ‘genetical genomics’. Nat Genet 37: 225–32.
6. CheungVG, SpielmanRS, EwensKG, WeberTM, MorleyM, et al. (2005) Mapping determinants of human gene expression by regional and genomewide association. Nature 437: 1365–9.
7. StrangerBE, NicaAC, ForrestMS, DimasA, BirdCP, et al. (2007) Population genomics of human gene expression. Nat Genet 39: 1217–24.
8. EmilssonV, ThorleifssonG, ZhangB, LeonardsonAS, ZinkF, et al. (2008) Genetics of gene expression and its effect on disease. Nature 452: 423–8.
9. SpielmanRS, BastoneLA, BurdickJT, MorleyM, EwensWJ, et al. (2007) Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet 39: 226–31.
10. FuJ, WolfsMGM, DeelenP, WestraHJJ, FehrmannRSN, et al. (2012) Unraveling the regulatory mechanisms underlying tissuedependent genetic variation of gene expression. PLoS Genet 8: e1002431.
11. de BakkerPIW, FerreiraMAR, JiaX, NealeBM, RaychaudhuriS, et al. (2008) Practical aspects of imputationdriven metaanalysis of genomewide association studies. Hum Mol Genet 17: R122–8.
12. COCHRANWG (1954) The combination of estimates from different experiments. BIOMETRICS 10: 101–129.
13. MANTELN, HAENSZELW (1959) Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 22: 719–48.
14. DerSimonianR, LairdN (1986) Metaanalysis in clinical trials. Control Clin Trials 7: 177–88.
15. IoannidisJPA, PatsopoulosNA, EvangelouE (2007) Heterogeneity in metaanalyses of genomewide association investigations. PLoS One 2: e841.
16. IoannidisJPA, PatsopoulosNA, EvangelouE (2007) Uncertainty in heterogeneity estimates in metaanalyses. BMJ 335: 914–6.
17. EvangelouE, MaraganoreDM, IoannidisJPA (2007) Metaanalysis in genomewide association datasets: strategies and application in parkinson disease. PLoS One 2: e196.
18. HanB, EskinE (2011) Randomeffects model aimed at discovering associations in metaanalysis of genomewide association studies. Am J Hum Genet 88: 586–98.
19. HanB, EskinE (2012) Interpreting metaanalyses of genomewide association studies. PLoS Genet 8: e1002555.
20. HigginsJ, ThompsonSG (2002) Quantifying heterogeneity in a metaanalysis. Statistics in medicine 21: 1539–1558.
21. StoreyJD (2002) A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64: 479–498.
22. PetrettoE, BottoloL, LangleySR, HeinigM, McDermottRoeC, et al. (2010) New insights into the genetic control of gene expression using a bayesian multitissue approach. PLoS Comput Biol 6: e1000737.
23. KimS, XingEP (2009) Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet 5: e1000587.
24. HooperM, HardyK, HandysideA, HunterS, MonkM (1987) Hprtdeficient (leschnyhan) mouse embryos derived from germline colonization by cultured cells. Nature 326: 292–5.
25. PetkovPM, DingY, CassellMA, ZhangW, WagnerG, et al. (2004) An efficient snp system for mouse genome scanning and elucidating strain relationships. Genome Res 14: 1806–11.
26. KangHM, ZaitlenNA, WadeCM, KirbyA, HeckermanD, et al. (2008) Efficient control of population structure in model organism association mapping. Genetics 178: 1709–23.
27. FleissJL (1993) The statistical basis of metaanalysis. Stat Methods Med Res 2: 121–45.
28. HARDYRJ, THOMPSONSG (1996) A likelihood approach to metaanalysis with random effects. Statistics in Medicine 15: 619–629.
29. HanB, SulJH, EskinE, de BakkerPIW, RaychaudhuriS (2013) A general framework for metaanalyzing dependent studies with overlapping subjects in association mapping. URL http://arxiv.org/abs/1304.8045.
30. LinDYY, SullivanPF (2009) Metaanalysis of genomewide association studies with overlapping subjects. Am J Hum Genet 85: 862–72.
31. StephensM, BaldingDJ (2009) Bayesian statistical methods for genetic association studies. Nat Rev Genet 10: 681–90.
32. MarchiniJ, HowieB, MyersS, McVeanG, DonnellyP (2007) A new multipoint method for genomewide association studies by imputation of genotypes. Nat Genet 39: 906–13.
Štítky
Genetika Reprodukční medicínaČlánek vyšel v časopise
PLOS Genetics
2013 Číslo 6
Nejčtenější v tomto čísle
Tomuto tématu se dále věnují…
 BMS1 Is Mutated in Aplasia Cutis Congenita
 Distinctive Expansion of Potential Virulence Genes in the Genome of the Oomycete Fish Pathogen
 Sexstratified Genomewide Association Studies Including 270,000 Individuals Show Sexual Dimorphism in Genetic Loci for Anthropometric Traits
 Environmental Dependence of Genetic Constraint