Integrated Pathway-Based Approach Identifies Association between Genomic Regions at CTCF and CACNB2 and Schizophrenia

Download PDF České info

Large-scale genetic studies of complex diseases such as schizophrenia have identified a variety of susceptibility loci. Since many of the respective variants have only a weak influence on disease risk, pathophysiological interpretation of the results is problematic. Investigation of the joint effects of multiple functionally related genes or pathways increases the power to detect disease related genes, and provides insights into the etiology of the disease in question. In the present study, an integrated hierarchical approach was applied to: (i) identify pathways associated with complex neuropsychiatric disease schizophrenia (ii) detect potentially affected genes in these pathways; and (iii) annotate the functional consequences of genetic markers in the affected genes or their regulatory regions. Two samples comprising >10,000 individuals of European ancestry as well as data from the Psychiatric Genomics Consortium schizophrenia study were examined. Pathways representing transcriptional regulation and gene expression, cell adhesion, apoptosis, and synapse organization showed significant association with schizophrenia. In particular, CTCF, CACNB2, and ARL5B, i.e. genes involved in chromatin modulation, calcium channel signaling and membrane transport, respectively, were highlighted as candidate genes for schizophrenia risk.

Published in the journal: . PLoS Genet 10(6): e32767. doi:10.1371/journal.pgen.1004345
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1004345

Summary

Introduction

Genome-wide association studies (GWAS) have identified common susceptibility variants for numerous disorders [1], [2]. For complex diseases, however, many of the discovered variants have only a moderate or weak effect on disease risk. Due to correction for multiple testing and limited sample sizes, GWAS are likely to miss a fraction of loci with small genetic effect sizes, and researchers assume that a major fraction of heritability remains hidden for statistical reasons [3]. One way of overcoming this problem is to investigate the joint effects of multiple functionally related genes (e.g. gene-sets or pathways). Pathway-based analysis of GWAS data increases the power to detect disease related genes and, potentially, single nucleotide polymorphisms (SNPs) with small genetic effects. This approach provides valuable biological insights into the etiology of complex diseases [4].

At the time of writing, several methods are in use for the pathway-based analysis of GWAS data [5], [6], and pathway association studies have identified novel candidate genes and pathways for a range of neuropsychiatric disorders [5], [7]–[12].

Various methodological approaches to pathway association analysis are available. Maciejewski [13] has described a classification for gene-set analysis that is based upon both the statistical model used and the nature of the underlying hypothesis. This classification comprises four groups: self-contained, competitive with sample randomization, competitive with gene randomization, and parametric. The main advantages of the self-contained and the competitive with sample randomization tests are twofold. Firstly, they resemble the underlying biological experiment. Secondly, the results are amenable to statistical interpretation [14], [13].

While selection of the pathway association method is an important consideration, the power of a given pathway association study is also dependent upon other factors. These include the biological information (i.e. from gene-set and pathway databases) that is integrated into the model, the use of independent replication datasets, and the different levels of interpretation, which extend from the pathway level to the level of SNPs.

As a logical consequence, researchers are now modifying analytical frameworks in order to increase their power and potential impact. To achieve this, the present study has applied a hierarchical approach (see Figure 1). This approach uses three levels of evidence to unravel novel biological mechanisms with potential involvement in complex disorders. An advantage of this approach is that it builds upon previously developed and proven tools which gain synergistic effects from intersecting three different levels of evidence, i.e. evidence from the pathway-, gene-, and SNP-level. To test disease associated gene-sets and pathways, the Global Test was applied [15], [16]. To date, this well-established, self-contained pathway test has mainly been used for gene expression analyses. Subsequent identification of important risk-genes within the significant pathways was achieved using FORGE [17], while detection of the functional consequences of associated SNPs, i.e. the SNP function annotation, in the significantly associated genes was performed using RegulomeDB [18]. As part of our approach, a well-curated list of pathways and gene-set collections was integrated, and a reduction in false-positive findings was sought through the use of large-scale exploratory and independent replication samples. We applied our approach to data sets for schizophrenia (SCZ), and provide evidence for new SCZ risk genes that would otherwise have remained undetected in the investigated study samples.

Fig. 1. Flowchart for (1) detection and replication of schizophrenia associated pathways and (2) identification of the most informative genes, and (3) functional annotation of single nucleotide polymorphisms in the genes of interest.

Results

Pathway analyses

Application of the Global Test to the BOMA-UTR (MooDS SCZ consortium (BOMA)) dataset and independent data from a Dutch study (UTR), Table 1) yielded 27 pathways that were significantly associated with SCZ after correction for multiple testing (False Discovery Rate (FDR)<0.05) (Table S1A). Of these, 14 pathways remained significant in the replication dataset. The replicated pathways are listed in Table 2, together with their FDRs, nominal p-values, and SNP set sizes. The replicated pathways include the following: (i) six gene-sets from the Transcription factor Targets database (dbTFT); (ii) four Gene Ontology (GO) terms (zinc ion binding, transition metal ion binding, positive regulation of gene expression, and synapse organization); (iii) two Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (cell adhesion molecules, and apoptosis); (iv) one gene-set from the Chemical and Genomic Perturbation database (dbCGP, Kyng DNA damage by UV); and (v) one gene-set from the microRNA targets database (mir-484 targets). The gene overlap for each pathway pair is shown in Figure S1. Table S2 summarizes the redundancy estimates for pathways retrieved from the same source. A description and a visual depiction of pathways with similar SNP content in the BOMA-UTR dataset are provided in Text S1 (section “Pathway overlap”) and Figure S2, respectively. The overall gene and SNP overlap between all pairs of replicated pathways are provided in Table S3A and Table S3C, respectively. For the GAIN-MGS dataset, the gene and SNP overlap information is provided in Table S3B and Table S3D, respectively. The section “Subject vs SNP label permutations” in Text S1 and Figure S3 provides a detailed description of the results of the SNP-label permutation test coupled with the subject-sampling test.

**Tab. 1. Description of individual samples.**

**Tab. 2. Comparisons of FDRs (BH) and p-values (P) for the BOMA-UTR and the GAIN-MGS data sets for the replicated pathways.**

To visualize the integration of the Global Test application on a SNP-, a gene -⁠ and a pathway level, Circos plots were generated for the entire genome (Figure 2). These plots illustrate the impact of those individual SNPs that were annotated to the replicated pathways (whether overlapping or unique to a specific pathway) and the associated genes.

**Fig. 2. (A) Circos plots integrating the Global Test and FORGE analysis and heatmaps for the levels of single nucleotide polymorphism (SNP)- and gene significance.**

Gene-based analysis

A total of 100 genes fulfilled the criteria described in the Methods section “Gene-based analysis with Global Test and FORGE”, i.e. these genes map to SNPs with a component Global Test p-value of <0.001 in the BOMA-UTR dataset. Of these, the following eight genes were annotated to at least four (up to eight) of the 14 replicated pathways, thus indicating their potential importance in terms of SCZ risk: FOXP2 (eight pathways); BCL11A (six pathways); PCDH7 and RPL36P13 (five pathways respectively); and CACNB2, CTCF, MECOM, and RIMS1 (four pathways respectively).

Of the genes that were annotated to the 14 replicated pathways, the top 100 were then tested in the Psychiatric Genomewide Association Study Consortium (PGC) data. Of these, significant results were obtained for 18 genes (see Table S4). The vast majority of the 18 genes reside on different chromosomes, while most of the remainder reside on different chromosome arms. It therefore seems reasonable to assume that they represent independent signals, which results in a p-value of 0.004 for an enrichment of SCZ-associated genes among the 100 top genes. Included in the list of 18 replicated genes are known SCZ susceptibility genes, such as NRXN1, GRM3, and MMP16. Two of the eight most frequent genes in the top 14 pathways were also among the nominally significant genes in the gene-based FORGE analysis, i.e. CACNB2 (p = 8.57×10⁻⁴) and CTCF (p = 0.015). Given the overlap (approx. 1,200 cases) between the PGC sample (FORGE analyses) and the present discovery sample (component Global Test), we opted to analyze the PGC dataset without including our discovery dataset. These analyses generated results of the same order of magnitude for both genes (CACNB2: p = 0.0090; CTCF: p = 0.0320). While CACNB2 showed a trend towards association in an independent dataset from Denmark (p = 0.0970), thus supporting the strong signal from the PGC data, CTCF was found to be strongly associated in the same independent Danish sample (p = 0.0075).

Potential functional consequences of SNPs in CTCF

Polyphen-2 predicted that the coding SNPs of interest in CTCF were “benign”, whereas SIFT predicted that they were “tolerated” (Table S5). Figure 3 illustrates the potential consequences predicted for SNPs in CTCF and its regulatory regions. These include SNPs genotyped in the present discovery study and SNPs identified as their proxies using SNAP. For the latter, only those that were annotated by RegulomeDB as being (1) likely to affect DNA binding of the protein and linked to expression of a gene target, or (2) likely to affect DNA binding, are listed. The complete functional annotation data for the SNPs of CTCF are provided in Table S5. All genotyped SNPs annotated to CTCF showed a significant (component Global Test p-value of ≤0.05) contribution to pathway associations. Of these, rs6499137 and rs7191281 were located at the 3′-UTR and the intron of CTCF, respectively. Given the 20 kb flanking region allowed for assigning the SNPs to a gene, the other two SNPs were considered to be shared with the neighboring gene RLTPR. Based on the functional annotation with the RegulomeDB database, the 3′-UTR SNP of CTCF (rs6499137) and its proxies were considered to be associated with the altered expression of the neighboring gene RLTPR (Figure 3, Table S5). One of the proxies (rs17686899) overlaps with a number of functional elements, such as open chromatin region, the binding sites for different transcription factors, and regions with certain histone modifications across many cell types. This suggests that the SNP was likely to affect the binding of a number of transcription factors to the genomic region of this gene. The respective expression quantitative trait loci (eQTL) information suggested that the SNP was likely to affect the expression of two genes, i.e. DUS2L and RLTPR. Among the CTCF-annotated SNPs, the intronic SNP of CTCF, rs7191281, was one of the top SNPs (component Global Test p-value of <0.001) contributing to the association of CTCF (and the association of the four replicated pathways containing CTCF). In addition, this SNP had the lowest p-value in the analyses of the PGC SCZ sample. While no information concerning functionality was available in the RegulomeDB database for this intronic SNP of CTCF, its proxy, rs13334205, was annotated with strong functional consequences. This proxy SNP was located in the regulatory region of CTCF and overlapped with the binding site of DNA-binding proteins, such as EBF1, TCF12, POLR2A, in an open chromatin region (Figure 3, Table S5).

**Fig. 3. RegulomeDB functional annotation for SNPs in CTCF and its regulatory regions.**

Potential functional consequences of SNPs in CACNB2

The complete functional annotation data for the SNPs of CACNB2 are provided in Table S6. The positions of the majority of the genotyped and the proxy SNPs of CACNB2 overlapped a motif match to the FOX (FOXP1, FOXJ1, FOXJ2) and GATA (GATA1, GATA3) family motifs in open chromatin regions. Among the SNPs mapped to CACNB2, rs12257556 and its proxy rs4748474 were annotated with the strongest functional consequences. These intronic SNPs were eQTLs for ARL5B, and overlapped an open chromatin region. The proxy SNPs rs35803482 and rs7897710 both overlap with the binding sites of RAD21, SMC3, CTCF, and have a motif match for FOXP1. The intronic SNP rs2799573 (which was also the most highly associated SNP of CACNB2 in the PGC data) lies in the binding region of a number of proteins, such as CDX2, CTCF, JUN, JUND, MEF2A, RAD21, and SMC3 (Table S6), as identified in the ENCODE ChIP-seq data across a diverse set of cell types.

Discussion

SCZ GWAS data analyses

In the present study, a genome-wide pathway association analysis was performed by means of the Global Test. The analyses involved well-curated descriptions of 7,350 pathways, and were carried out on large-scale discovery and replication datasets. A gene-based analysis of genes with a high contribution to the significance of the top pathways was then performed using the SCZ GWAS results of the PGC. Finally, a functional SNP-based analysis of the top hit genomic regions was conducted. Through this hierarchical approach, we were able to replicate pathway findings from previous studies of SCZ and detect novel pathways and genomic regions with an association to SCZ in the investigated samples. In the discovery set, we detected evidence for a significant contribution of 27 pathways. Of these, 14 remained significant in the replication dataset. The 14 replicated pathways are involved in transcriptional regulation and gene expression, synapse organization, cell adhesion, and apoptosis.

Previous pathway analyses of SCZ GWAS data have identified associations with pathways that are mainly involved in processes critical to synaptic function, neurodevelopment, cell adhesion, the immune system, the estrogen biosynthetic process, and apoptosis [10], [19], [20]. One of the 14 significant pathways in the present study, i.e. cell adhesion, was also the most significant pathway in the study by O'Dushlaine et al. [10]. Jia et al. [19] reported nominal significance for the following four pathways: CARM_ER (CARM1 and Regulation of the Estrogen Receptor); glutamate metabolism; TNFR1; and TGF beta signaling. Glutamate is implicated in synaptic neurotransmission, and TGF-beta and TNFR1 signaling are involved in several cellular processes, including apoptosis and excitotoxicity. The top hit pathways “synaptic organization” and “apoptosis” from the present study are thus consistent with the results of Jia et al [19].

However, the majority of pathways with significant association to SCZ in the present study are novel, and they are mainly involved in transcriptional regulation and gene expression. One reason for the failure of previous pathway-based studies of SCZ to generate similar findings may have been that they focused mainly on gene sets from the KEGG and BioCarta databases, whereas we accessed several pathway databases. These included the GO database, as well as special gene-set collections on chemical and genomic perturbations (dbCGP), and transcriptional regulation such as dbTFT and dbMIR. It should be noted that only few of our 14 replicated pathways achieved significance in the analysis of our discovery sample using GRASS [21], gseaSNP [22], and ALIGATOR [23]; see Text S1 and Table S1C). The difference in results can be explained by the different assumptions these alternative pathway approaches rest on.

As part of our hierarchical approach, we aimed to identify which genes in a particular pathway could be responsible for the association with SCZ risk. Integration of gene-based analysis facilitated both the prioritization of potential candidate genes and more precise formulation of hypotheses concerning the functional consequences of the potential pathway perturbations (i.e. at the gene -⁠ and SNP-level). In particular, we explored how variants that emerged as being of importance for our pathway -⁠ and gene-based signals might affect the function and regulation of other genes.

In the gene-based analysis, CACNB2 and CTCF showed the strongest evidence for association with SCZ in both the present samples and in those of the PGC. The gene CACNB2 encodes an auxiliary voltage-dependent L-type calcium-channel subunit that is mainly expressed in heart and brain tissue [24]. This subunit is essential for normal surface expression, adequate trafficking, and functioning of voltage-gated calcium channels [24]. Recently, CACNB2 was among four loci with genome-wide significance in a cross-disorder analysis of GWAS data for autism spectrum disorder, attention deficit-hyperactivity disorder, bipolar disorder, major depressive disorder, and SCZ [25]. Previously, CACNB2 had been one of the top hit regions in a GWAS of bipolar disorder I in a Han Chinese population [26]. Functionally, the calcium channel beta-2 subunit encoded by CACNB2, together with the calcium channel alpha(2)/delta subunit, affects the kinetics and expression of Ca(V)1.2 (encoded by CACNA1C) [27]. CACNA1C is a well-established susceptibility gene for bipolar disorder, SCZ, and major depressive disorder [25], [28]–[31]. The RegulomeDB search of genotyped SNPs and their proxies in CACNB2 resulted in the detection of the intronic SNPs rs12257556 and rs10764566, and these were eQTLs for ARL5B. The gene ARL5B encodes a trans-Golgi network localized small G protein that has been described as a key regulator of retrograde membrane transport [32]. Altered ARL5B expression may be involved in the dysregulation of axonal transport. Interestingly, a previous study found that the transcript of one of the most widely studied susceptibility genes for SCZ, DISC1, was an interacting molecule for a motor protein of axonal transport [33]. It is of note that SNPs (both genotyped and proxies) at the CACNB2 locus suggested an interplay with our second gene of interest, i.e. CTCF. Such a connection is also suggested with RAD21. A substantial body of literature describes an interaction between RAD21 and CTCF, particularly in neurons [34], [35]. Although few data are available on a potential interaction between CACNB2 and RAD21/CTCF, moderate evidence is available from several protein-protein interaction databases (data not shown) for an interplay between CTCF, RAD21, and ARL5B.

CTCF encodes a transcriptional regulator protein with 11 conserved zinc finger domains, and is an important modulator of conformational changes in chromatin [36]. A recent study of conditional knockout of the ctcf gene in mice demonstrated that CTCF was a key regulator of neuronal differentiation, and was essential for neuronal diversity and functional neural networks [37]. The authors showed that CTCF was required for appropriate dendritic arborization and synapse formation, since it controlled clustered protocadherin expression. Previous studies have shown an association between genetic variation in the protocadherin gene cluster and SCZ [38], [39]. Our result adds to this body of research the finding that transcriptional regulation of genes essential for neuronal diversity, such as the regulation of protocadherins by CTCF, may alter synaptic connectivity and thus contribute to the etiology of SCZ. Intriguingly, evidence from the majority of CTCF SNPs (both genotyped and proxies) suggested that the variants influence RLTPR expression (Figure 3). The RLTPR gene is expressed in several brain regions (EMBL-EBI Expression Atlas; http://www.ebi.ac.uk/gxa/gene/ENSG00000159753). The resulting protein has a RGD (Arginine-Glycine-Aspartic acid) motif [40]. This is a universal cell recognition site of extracellular proteins and interacts with a family of cell-surface receptors, such as integrins for cell-adhesion molecules [41]. Together with the replicated KEGG pathway cell adhesion molecules, this finding strongly supports the hypothesis that modulation of adhesion, and interactions between cells as well as cell and the extracellular matrix, are implicated in the etiology of SCZ.

Another top hit gene in the present study was FOXP2, which was among the top genes in eight of the 14 most implicated pathways. FOXP2 (forkhead-box P2) is a transcription factor with an essential role in the development of speech and language regions in the brain. The fact that SCZ patients often show language impairments such as reading difficulties [42] renders FOXP2 a plausible SCZ candidate gene. Interestingly, a previous study reported an association between genetic variation in FOXP2 and SCZ in a Han Chinese population [43]. Furthermore, Walker et al. [44] identified FOXP2 as an inhibitor of the promoter activity and protein expression of DISC1. The present study supports the hypothesis that FOXP2 plays an important role in SCZ on the level of the transcriptional regulation of target genes.

The association with the apoptosis pathway was driven predominantly by a SNP which mapped to AKT3. Besides being detected via the Global Test, this gene was the most significantly associated gene in the FORGE analysis of the PGC data. AKT3 is a serin/threonine protein kinase, and is a member of the AKT family. It is involved in many biological processes, including apoptosis and cellular proliferation [45]. In a recent study by Diez et al. [46], AKT3 was identified as a modulator of the fine regulation of apoptotic processes and axon growth. Disruption of AKT3 significantly reduced axon length and viability of neurons in cell culture [46]. Moreover, AKT3 is the most abundant AKT member in the brain during neurogenesis. AKT3 controls brain size, and research has shown that genetic variation (duplication and point mutation) of AKT3 contributes to hemimegalencephaly [47].

In conclusion, the present study demonstrated that use of information from databases focusing on cell-regulatory networks together with information from traditional pathway database resources can facilitate the identification of susceptibility factors for the complex neuropsychiatric disease SCZ. Through the application of a well-designed hierarchical framework, our study highlighted the importance of calcium channel signaling, cell adhesion, and the modulation of transcriptional regulation implicated in neuronal diversity, neurite growth, and synapse formation in the etiology of SCZ. In particular, CTCF and CACNB2 (and possibly ARL5B) were identified as SCZ candidate genes.

Materials and Methods

Ethics statement

Each participant provided written informed consent prior to inclusion and all aspects of the study complied with the Declaration of Helsinki. The study was approved by the ethics committees of all study centers. For the German samples, this comprised the Ethics Committee of the Rheinische Friedrich-Wilhelms-University Medical School in Bonn, Ethics Committee “Medizinische Ethik-Kommission II” of the University of Heidelberg, the Ethics Committee of the Friedrich-Schiller-University Medical School in Jena, and the Ethics Committee of the Ludwig-Maximilians-University Munich. Samples obtained through dbGaP were collected using institutional review board-approved protocols in three studies, i.e. Schizophrenia Genetics Initiative (SGI), Molecular Genetics of Schizophrenia Part 1 (MGS1), and MGS2.

Data sets

Participants from four datasets were included (Table 1). The discovery set was the BOMA-UTR sample. This consisted of data from the MooDS SCZ consortium (BOMA) [48], [49], and independent data from a Dutch study (UTR) [48], and comprised 2,230 SCZ cases and 2,810 controls. The replication set consisted of the GAIN [dbGaP accession number: phs000021.v2.p1], and the MGS [dbGaP accession number: phs000167.v1.p1] datasets, and comprised 2,436 SCZ cases and 2,646 controls [50]. The BOMA and MGS samples were also used in the PGC SCZ study. An overlap of 80% existed between the PGC study and the sample used in the present pathway-based analysis.

Linkage disequilibrium (LD)-based SNP pruning

To accommodate the Global Test's assumption of independence between variables, the SNP set was reduced according to a variance inflation factor (VIF) and using a sliding window approach, as implemented in PLINK [51] (http://pngu.mgh.harvard.edu/purcell/plink/, version 1.07). A VIF of 100 was used. The window size was set at 50 SNPs, and was shifted by 5 SNPs at each step. An LD-based pruned set of SNPs (Table 1) was then considered for mapping to pathways. A detailed description of this procedure is provided in Text S1 (section “SNP independence and LD-based SNP pruning”) and in Table S7.

PGC sample

For the gene-based analysis, PGC data (https://pgc.unc.edu/ResultFiles/pgc.scz.2012-04.zip) were used.

Annotation of SNPs to genes

SNPs were annotated with information from dbSNP Build 127. The “seq-gene” file containing information for annotating the SNP rs numbers to ENTREZ gene IDs was downloaded from the NCBI ftp website (BUILD 36.3). SNPs were assigned to a gene if the SNP was located within the genomic sequence or within 20 kb of the 5′ and 3′ ends of the first and last exons in order to account for important regulatory regions [52]. If a SNP was within a region shared by more than one gene, it was assigned to all genes (for details see Text S1).

Pathway and gene-set databases

Selected gene-set collections were accessed from the Molecular Signatures Database (MSigDB, version 3.0) [53] website (http://www.broadinstitute.org/gsea/msigdb). This included the pathways from BioCarta (217 pathways), Chemical and Genomic Perturbations (1,825 gene-sets), Reactome (775 pathways), MicroRNA Targets (176 gene-sets), and Transcription Factor Targets (456 gene-sets). Information concerning GO terms [54] and KEGG pathways [55], [56] was obtained from the respective R packages (3,686 GO terms; GO.db, version 2.5.0; 215 KEGG pathways; R package KEGG.db, version 2.5.0). At the time of data retrieval (June, 2011), these repositories were more up-to-date than the MSigDB database. A total of 7,350 pathways were included. These were represented by 237 788 (53.7%) of the SNPs in the BOMA-UTR dataset. Hence 53.7% of SNPs genotyped in the exploration samples were mapped to pathways. For the SNP data, SNP effect was coded as an allele dose effect (0, 1, 2). Detailed information on the pathway information overlap and redundancy is provided in Text S1 (section “Choice of pathways and gene-sets”) and in Table S2 and Figure S1.

Pathway analysis with the Global Test

For the pathway-based analysis, the Global Test [15] was used (R package globaltest, version 5.12.0; Figure 1). The Global Test takes the individual level GWAS data as an input, and tests whether the global polymorphism pattern of a group of genes is significantly associated with the phenotype of interest. To account for both a potential underlying correlation structure and pathway and/or gene size, the Global Test with subject sampling was applied on the basis of 10,000 permutations of case-control status [15]. To study the impact of pathway and/or gene size in more detail, a SNP label permutation test was performed (for detailed information see Text S1, section “Subject vs SNP label permutations”).

At the discovery stage of the analysis, less conservative correction for multiple testing was applied in order to prioritize the identification of associated pathways. This was a legitimate approach, since any false positives would be controlled for in the replication analysis. Multiplicity correction was applied for each individual collection of pathways/gene-sets. For pathways/gene-sets retrieved from the KEGG, Reactome, and MSigDB gene-set collections, the pathway scores were corrected for multiple testing using the Benjamini-Hochberg method [57]. A pathway was considered to be significantly associated with the phenotype of interest (i.e. SCZ) if the false discovery rates from all three of the following were <0.05: (i) un-permuted test; (ii) the subject-sampling test; and (iii) the SNP-label permutation tests. The resulting list of significant pathways was ranked according to the false discovery rate obtained from the SNP-label permutation tests. For the GO terms, correction for multiple testing was performed using the Focus Level method [58]. A GO term was considered to be significant if both of the following were <0.05: (i) the focus level obtained from the un-permuted test; and (ii) the false discovery rate obtained from the subject-sampling test. To account for a gender-specific variance in the perturbed pathways, control for gender was used as a covariate [15].

Component Global Test

To estimate the contributions of individual SNPs to a pathway -⁠ or a gene association, the component global test was performed using the covariates function implemented in the R package globaltest [15]. Throughout the text, the single SNP p-values obtained using the Global Test refer to the results obtained using the component global test.

The Global Test with the replication dataset

Only pathways that were significantly associated with SCZ in the discovery set were followed-up (Figure 1, step 1). All tests in the follow-up step were performed as described above, with the exception that all tested pathways were subjected to Benjamini-Hochberg correction for multiple testing. Possible stratification in the data was investigated using a multi-dimensional scaling (MDS) approach. MDS covariates were obtained from PLINK using a previously described protocol [48]. To correct for the potential effect of stratification on the association test, the Global Test was run with four leading MDS dimensions as covariates.

Gene-based analysis with Global Test and FORGE

The aim of the second step (Figure 1, gene-based analysis) was to identify genes of particular importance to the replicated pathways. Genes that mapped to one or more of the identified pathways were analyzed (Figure 1, step 2). First, the component global test was performed for every individual SNP that was annotated to the replicated pathways. SNPs with a component global test p-value of <0.001 in the BOMA-UTR dataset were then annotated to genes. These genes are referred to as “top genes” in the subsequent text. Gene-based analysis of PGC data for the top genes was then conducted using FORGE [17] As with the Global Test, the analyses focused on genomic sequences that included both the genes themselves and a 20 kb window on either side of the respective gene to account for important regulatory regions. Along with the summary statistics of the PGC, genotype data from the European HapMap 3 samples were used (CEU and TSI). Details of the program and the test statistic used to calculate the gene-based p-values (fixed-effects Z score method) are provided elsewhere [17]. Genes that remained nominally significant (p<0.05) in both the component global test and the FORGE analyses were considered for the third step of the analyses (SNP function annotation). No correction for multiple testing was performed. However, replication of our most interesting findings was sought in an independent dataset from Denmark. Detailed information on these Danish samples is provided elsewhere [59].

SNP function annotation

The third step (Figure 1, SNP function annotation) focused on genes identified in step 2. Evidence that SNPs annotated to these genes are implicated in SCZ was sought by investigating the potential consequences of SNPs in terms of gene regulation or function. For each gene of interest, we first selected all SNPs that were annotated to this gene and which had shown evidence for association with SCZ in the discovery dataset (Global Test, p≤0.05). To account for the relevant information from other correlated SNPs, we then identified all SNPs from the 1000 genomes project (pilot project) [60] that showed strong LD with the associated SNPs (r²>0.8, maximum distance between both SNPs = 500 kb). The webtool SNAP [61] (Version 2.2) was used. Each query SNP was included as its own proxy. RegulomeDB [18] and Polyphen-2/SIFT [62], [63] were used for the functional classification of non-coding and coding SNPs, respectively.

Supporting Information

Zdroje

1. ManolioTA, BrooksLD, CollinsFS (2008) A HapMap harvest of insights into the genetics of common disease. J Clin Invest 118 : 1590–1605.

2. HindorffLA, SethupathyP, JunkinsHA, RamosEM, MehtaJP, et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106 : 9362–9367.

3. JiaP, WangL, MeltzerHY, ZhaoZ (2010) Pathway-based analysis of GWAS datasets: effective but caution required. Int J Neuropsychopharmacol 14 : 567–572.

4. HeroldC, MattheisenM, LacourA, VaitsiakhovichT, AngischM, et al. (2009) Integrated genome-wide pathway association analysis with INTERSNP. Hum Hered 73 : 63–72.

5. WangK, LiM, HakonarsonH (2010) Analysing biological pathways in genome-wide association studies. Nat Rev Genet 11 : 843–854.

6. RamananVK, ShenL, MooreJH, SaykinAJ (2012) Pathways analysis of genomic data: concepts, methods, and prospoects for future development. Trends in Genetics 28 : 323–332.

7. TorkamaniA, TopolEJ, SchorkNJ (2008) Pathway analysis of seven common diseases assessed by genome-wide association. Genomics 92 : 265–272.

8. AsklandK, ReadC, MooreJ (2009) Pathways-based analysis of whole-genome association study data in bipolar disorder reveal genes mediating ion channel activity and synaptic neurotransmission. Hum Genet 125 : 63–79.

9. HolmansP, GreenEK, PahwaJS, FerreiraMA, PurcellSM, et al. (2009) Gene ontology analysis of GWA study datasets provides insights into the biology of bipolar disorder. Am J Hum Genet 85 : 13–24.

10. O'DushlaineC, KennyE, HeronE, DonohoeG, GillM, et al. (2010) Molecular pathways involved in neuronal cell adhesion and membrane scaffolding contribute to schizophrenia and bipolar disorder susceptibility. Mol Psychiatry 16 : 286–292.

11. WengL, MacciardiF, SubramanianA, GuffantiG, PotkinSG, et al. (2010) SNP-based pathway enrichment analysis for genome-wide association studies. BMC Bioinformatics 12 : 99 doi:10.1186/1471-2105-12-99

12. JiaP, WangL, FanousAH, ChenX, KendlerKS, et al. (2012) A bias-reducing pathway enrichment analysis of genome-wide association data confirmed association of the MHC region with schizophrenia. J Med Genet 49 : 96–103.

13. MaciejewskiH (2013) Gene set analysis methods: statistical models and methodological differences. Brief Bioinform doi: 10.1093/bib/bbt002

14. GoemanJJ, BühlmannP (2004) Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23 : 980–987.

15. GoemanJJ, van de GeerSA, de KortF, van HouwelingenHC (2004) A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20 : 93–99.

16. DeelenJ, UhHW, MonajemiR, van HeemstD, ThijssenPE, et al. (2013) Gene set analysis of GWAS data for human longevity highlights the relevance of the insulin/IGF-1 signaling and telomere maintenance pathways. Age 35 : 235–249.

17. PedrosoI, BreenG (2011) Gene Set Analysis and Network Analysis for Genome-Wide Association Studies. Cold Spring Harb Protoc doi:10.1101/pdb.top065581

18. BoyleA, HongE, HariharanM, ChengY, SchaubM, et al. (2012) Annotation of functional variation in personal genomes using RegulomeDB. Genome Res 22 : 1790–1797.

19. JiaP, WangL, MeltzerHY, ZhaoZ (2010) Common variants conferring risk of schizophrenia: a pathway analysis of GWAS data. Schizophr Res 122 : 38–42.

20. LeeYH, KimJH, SongGG (2013) Pathway analysis of a genome-wide association study in schizophrenia. Gene 525 : 107–115.

21. ChenLS, HutterCM, PotterJD, LiuY, PrenticeRL, et al. (2010) Insights into Colon Cancer Etiology via a Regularized Approach to Gene Set Analysis of GWAS Data. Am J Hum Genet 86 : 860–871.

22. WangK, LiM, BucanM (2007) Pathway-Based Approaches for Analysis of Genomewide Association Studies. Am J Hum Genet 81 : 1278–1283.

23. HolmansP, GreenEK, PahwaJS, FerreiraMA, PurcellSM, et al. (2009) Gene ontology analysis of GWA study datasets provides insights into the biology of bipolar disorder. Am J Hum Genet 85 : 13–24.

24. BuraeiZ, YangJ (2013) Structure and function of the β subunit of voltage-gated Ca(2+) channels. Biochim Biophys Acta 1828 : 1530–1540.

25. Cross-Disorder Group of the Psychiatric Genomics Consortium (2010) SmollerJW, CraddockN, KendlerK, LeePH, et al. (2010) Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet 381 : 1371–1379. Erratum in: Lancet 381 : 1360.

26. LeeMT, ChenCH, LeeCS, ChenCC, ChongMY, et al. (2011) Genome-wide association study of bipolar I disorder in the Han Chinese population. Mol Psychiatry 16 : 548–556.

27. KobayashiT, YamadaY, FukaoM, TsutsuuraM, TohseN (2007) Regulation of Cav1.2 current: interaction with intracellular molecules. J Pharmacol Sci 103 : 347–353.

28. Psychiatric GWAS Consortium Bipolar Disorder Working Group (2013) Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4. Nat Genet 43 : 977–983.

29. FerreiraMA, O'DonovanMC, MengYA, JonesIR, RuderferMD, et al. (2008) Collaborative genome-wide association analysis supports a role for ANK3 and CACNA1C in bipolar disorder. Nat Genet 40 : 1056–1058.

30. Schizophrenia Psychiatric Genome-Wide Association Study (GWAS) Consortium (2011) RipkeS, SandersAR, KendlerKS, LevinsonDF, et al. (2011) Genome-wide association study identifies five new schizophrenia loci. Nat Gen 43 : 969–976.

31. GreenEK, GrozevaD, JonesI, JonesL, KirovG, et al. (2010) The bipolar disorder risk allele at CACNA1C also confers risk of recurrent major depression and of schizophrenia. Mol Psychiatry 15 : 1016–1022.

32. HoughtonFJ, BellinghamSA, HillAF, BourgesD, AngDK, et al. (2012) Arl5b is a Golgi-localised small G protein involved in the regulation of retrograde transport. Exp Cell Res 318 : 464–477.

33. TayaS, ShinodaT, TsuboiD, AsakiJ, NagaiK, et al. (2007) DISC1 regulates the transport of the NUDEL/LIS1/14-3-3epsilon complex through kinesin-1. J Neurosci 27 : 15–26.

34. GuoY, MonahanK, WuH, GertzJ, VarleyKE, et al. (2012) CTCF/cohesin-mediated DNA looping is required for protocadherin α promoter choice. Proc Natl Acad Sci U S A 109 : 21081–21086.

35. MonahanK, RudnickND, KehayovaPD, PauliF, NewberryKM, et al. (2012) Role of CCCTC binding factor (CTCF) and cohesin in the generation of single-cell diversity of protocadherin-α gene expression. Proc Natl Acad Sci U S A 109 : 9125–9130.

36. PhillipsJE, CorcesVG (2009) CTCF: master weaver of the genome. Cell 137 : 1194–1211.

37. HirayamaT, TarusawaE, YoshimuraY, GaljartN, YagiT (2012) CTCF is required for neural development and stochastic expression of clustered Pcdh genes in neurons. Cell Rep 2 : 345–357.

38. KirovG, GeorgievaL, WilliamsN, NikolovI, NortonN, et al. (2003) Variation in the protocadherin gamma A gene cluster. Genomics 82 : 433–440.

39. GregórioSP, SalletPC, DoKA, LinE, GattazWF, Dias-NetoE (2009) Polymorphisms in genes involved in neurodevelopment may be associated with altered brain morphology in schizophrenia: preliminary evidence. Psychiatry Res 165 : 1–9.

40. MatsuzakaY, OkamotoK, MabuchiT, IizukiM, OzawaA, et al. (2004) Identification, expression analysis amd polymorphism of a novel RLTPR gene encoding a RGD motif, tropomodulin domain and proline/leucine-rich regions. Gene 343 : 291–304.

41. D'SouzaSE, GinsbergMH, PlowEF (1991) Arginyl-glycyl-aspartic acid (RGD): a cell adhesion motif. Trens Biochem Sci 16 : 246–250.

42. JamadarS, PowersNR, MedaSA, GelernterJ, GruenJR (2011) Genetic influences of cortical gray matter in language-related regions in healthy controls and schizophrenia. Schizophr Res 129 : 141–148.

43. LiT, ZengZ, ZhaoQ, WangT, HuangK, et al. (2012) FoxP2 is significantly associated with schizophrenia and major depression in the Chinese Han Population. World J Biol Psychiatry 14 : 146–150.

44. WalkerRM, HillAE, NewmanAC, HamiltonG, TorranceHS, et al. (2012) The DISC1 promoter: characterization and regulation by FOXP2. Hum Mol Genet 21 : 2862–2872.

45. NakataniK, SakaueH, ThompsonDA, WeigelRJ, RothRA (1999) Identification of a human Akt3 (protein kinase B gamma) which contains the regulatory serine phosphorylation site. Biochem Biophys Res Commun 257 : 906–910.

46. DiezH, GarridoJJ, WandosellF (2012) Specific roles of Akt iso forms in apoptosis and axon growth regulation in neurons. PLoS ONE 7: e32715 doi:10.1371/journal.pone.0032715

47. PoduriA, EvronyGD, CaiX, ElhosaryPC, BeroukhimR, LehtinenMK (2012) Somatic activation of AKT3 causes hemispheric developmental brain malformations. Neuron 74 : 41–48.

48. RietschelM, MattheisenM, DegenhardtF (2012) Genetic Risk and Outcome in Psychosis (GROUP Investigators) (2012) MühleisenTW, et al. (2012) Association between genetic variation in a region on chromosome 11 and schizophrenia in large samples from Europe. Mol Psychiatry 17 : 906–917.

49. PriebeL, DegenhardtF, StrohmaierJ, BreuerR, HermsS, et al. (2013) Copy Number Variants in German Patients with Schizophrenia. PLoS ONE 8 (7) e64035.

50. ShiJ, LevinsonDF, DuanJ, SandersAR, ZhengY, et al. (2009) Common variants on chromosome 6p22.1 are associated with schizophrenia. Nature 460 : 753–757.

51. PurcellS, NealeB, Todd-BrownK, ThomasL, FerreiraMAR, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81 : 559–575.

52. VeyrierasJB, KudaravalliS, KimSY, DermitzakisET, GiladY, et al. (2008) High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet 2008; 4: e1000214.

53. SubramanianA, TamayoP, MoothaVK, MukherjeeS, EbertBL, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102 : 15545–15550.

54. The Gene Ontology Consortium Gene ontology: tool for the unification of biology. Nat Genet 25 : 25–29.

55. KanehisaM, GotoS (2010) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28 : 27–30.

56. KanehisaM, GotoS, SatoY, FurumichiM, TanabeM (2012) KEGG for integration and interpretation of large-scale molecular datasets. Nucleic Acids Res 40: D109–D114.

57. BenjaminiY, HochbergY (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Stat Methodol 57 : 289–300.

58. GoemanJJ, MansmannU (2008) Multiple testing on the directed acyclic graph of gene ontology. Bioinformatics 24 : 537–544.

59. BørglumAD, DemontisD, GroveJ, PallesenJ, HollegaardMV, et al. (2013) Genome-wide study of association and interaction with maternal cytomegalovirus infection suggests new schizophrenia loci. Mol Psychiatry 2013 doi: 10.1038/mp.2013.2

60. 1000 Genomes Project Consortium (2010) AbecasisGR, AltshulerD, AutonA, BrooksLD, et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467 : 1061–1073.

61. JohnsonA, HandsakerR, PulitS, NizzariM, O'DonnellC, et al. (2008) SNAP: A web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24 : 2938–2939.

62. AdzhubeiIA, SchmidtS, PeshkinL, RamenskyVE, GerasimovaA, et al. (2010) A method and server for predicting damaging missense mutations. Nat Meth 7 : 248–249.

63. KumarP, HenikoffS, NgPC (2009) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 4 : 1073–1081.