SECA: SNP Effect Concordance Analysis
To run SECA:Approach:
SECA first aligns the single nucleotide polymorphism (SNP) effects across two sets of genome-wide association study (GWAS) summary results (dataset1 and dataset2). All SNPs are assumed to be on the positive (forward) strand. SECA's SNP effect alignment is based on the extensively used METAL (Willer et al 2010) source code. As SECA requires both the effect allele (EA) and non-effect allele (NEA) for all SNPs, SECA appropriately accounts for situations when different input files use different reference alleles. Alleles can be coded numerically (A=1,C=2,G=3,T=4) or alphabetically (A,C,G,T,a,c,g,t) and can be on either strand if not an A/T or C/G SNP. For A/T or C/G SNPs, SECA assumes SNPs to be on a consistent (i.e., "positive" or "forward") strand in the two input files for the results to be interpretable. For other SNPs, SECA can automatically identify and resolve strand inconsistencies.
SECA
extracts a subset of independent SNPs via linkage disequilibrium (LD)
clumping (Purcell et al. 2007). The default clumping procedure is
'p-value informed'. The approach iterates from the first to last SNP on each chromosome
sorted from smallest to largest p-value in dataset1 (P_{1})
that has not already been clumped (denoting this as the index SNP) and forms clumps of all other SNPs
that are within 1 Mb and in LD with the index SNP (r^{2} > 0.1)
based on HapMap or 1000 Genomes Project (1kGP) genotype data. A subsequent
round of LD clumping is performed to ensure none of the index SNPs (from the
first round) within 10 Mb
of each other are in long-range LD (r^{2} > 0.1). This default
approach identifies the subset of independent SNPs with the
most significant association p-values in dataset1.
An alternate LD clumping approach identifies an effectively random subset of
independent SNPs by setting all dataset1 p-values
(P_{1}) equal to 0.5 (and since 23 Jan 2017, sorting the SNPs by their name [in case your uploaded datasets were already p-value sorted]) before the clumping process.
Please note: the 1kGP reference options (with b37 bp positions) for LD estimation are provided here for
convenience to cater for users with common SNP rsIDs (MAF > 0.01) which may not
be in a specific HapMap 2 or HapMap 3 reference dataset. SECA will happily analyse
several million SNPs (i.e., HapMap 2 and 3 based imputation) but it is not
designed (or necessary) to analyse 1kGP imputed GWAS summary results (e.g., >6 million SNPs
after QC).
Indeed, our web server will time out if your files have not finished
uploading within 60 minutes!
Furthermore, because independent SNPs are required, analysis of observed genotype data from SNPs on GWA
arrays is sufficient. Users with GWAS summary results from 1kGP imputation
and/or wanting more control over SNP selection should perform their own LD clumping and
instead use iSECA.
Restricting to SNPs associated with P_{1} ≤ {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} in dataset1, SECA first performs exact binomial statistical tests using the R statistical package (R Development Core Team 2008) to determine whether there is an excess of SNPs associated in both datasets (dataset1 and dataset2) for the subset of SNPs associated with P_{2} ≤ {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} in dataset2. For each of the 144 SNP subsets (generated using these 12 × 12 p-value threshold {P_{1}} ⋂ {P_{2}} combinations), binomial tests of associated SNPs in dataset1 (P_{1}) and dataset2 (P_{2}) are performed. A binomial test 'heatmap' plot is generated to graphically summarise the proportion of SNP subsets with an excess [observed (obs) ≥ expected (exp)] or deficit (obs < exp) number of associated SNPs, and an empirical (permuted) p-value is calculated for the observed number of subsets with nominally significant excess. That is, a permuted p-value (P_{BT-permuted}) for the observed number of binomial tests with obs ≥ exp and P_{BT} ≤ 0.05. We note that for polygenic traits, a well-powered GWAS should produce an excess of smaller p-values. Therefore, instead of specifying P_{2} as the expected proportion of SNPs with P_{1} and P_{2}, the proportion of SNPs associated in dataset2 (irrespective of its association in dataset1) is utilised as the 'overlap' (null) probability (i.e., 'expected proportion') in the binomial tests. For example, the observed proportion of total SNPs with P_{1} ≤ 1 and P_{2} ≤ 0.01 is used as the expected proportion of SNPs with P_{1} ≤ {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} ⋂ P_{2} ≤ 0.01.
SECA next performs Fisher's exact statistical tests to determine whether there is an excess of SNPs where the effect directions―regression coefficient (BETA) or odds ratio (OR) [OR = exp(BETA)]―are concordant across dataset1 and dataset2 for the subset of SNPs associated with P_{1} ≤ {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} in dataset1 and P_{2} ≤ {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} in dataset2. For each of the 144 SNP subsets (generated using these 12 × 12 p-value threshold {P_{1}} ⋂ {P_{2}} combinations), Fisher's exact statistical tests of SNP effects in dataset1 (BETA_{1} or OR_{1}) and dataset2 (BETA_{2} or OR_{2}) are performed on 2 × 2 tables containing the number of SNPs with both OR_{1} < 1 and OR_{2} < 1 [i.e., OR_{1} < 1 ⋂ OR_{2} < 1 (−−)], OR_{1} ≥ 1 ⋂ OR_{2} < 1 (+−), OR_{1} < 1 ⋂ OR_{2} ≥ 1 (−+), and OR_{1} ≥ 1 ⋂ OR_{2} ≥ 1 (++). A Fisher's test 'heatmap' plot is generated to graphically summarise the proportion of SNP subsets with concordant (Fisher's test odds ratio, OR_{FT} ≥ 1) and discordant (OR_{FT} < 1) SNP effects and an empirical (permuted) p-value is calculated for the observed number of subsets with nominally significant concordance. That is, a permuted p-value (P_{FT-permuted}) for the observed number of Fisher's tests with OR_{FT} ≥ 1 and P_{FT} ≤ 0.05.
An R script ('SECA_Ranalysis_HeatmapPerm.R') allowing users to increase the number of replicates (default = 1000) for the permutation of heatmap p-values using the "independent_aligned_effects.txt" file generated by SECA can be downloaded here.
Restricting to SNPs nominally associated with P_{1} ≤ 0.05 in dataset1, SECA next reports detailed results from binomial and Fisher's statistical tests to determine whether there is an excess of overlapping ('pleiotropic') SNPs and whether the effect directions are concordant (or discordant) across dataset1 and dataset2 for the subset of SNPs nominally associated with P_{2} ≤ 0.05 in dataset2.
SECA next reports detailed binomial and Fisher's test results (where applicable) for two practical subsets: i) SNPs genome-wide significant (P_{1} ≤ 5 × 10^{-8}) in dataset1, and nominally associated (P_{2} ≤ 0.05) in dataset2; and ii) SNPs genome-wide suggestive (P_{1} ≤ 1 × 10^{-5}) in dataset1, and nominally associated (P_{2} ≤ 0.05) in dataset2
A 'grid' search is also performed across 150 p-value thresholds to identify the subset of SNPs overlapping dataset1 and dataset2 that produce the minimum binomial test ('pleiotropic') p-value and minimum Fisher's test (effect concordance) p-value. The search is performed at 5 equidistant p-value {P_{1}, P_{2}} increments for the interval [5 × 10^{-8}, 9 × 10^{-8}], i.e., {5 × 10^{-8}, 6 × 10^{-8}, 7 × 10^{-8}, 8 × 10^{-8}, 9 × 10^{-8}}; 9 equidistant increments for each of the intervals [1 × 10^{-7}, 9 × 10^{-7}], [1 × 10^{-6}, 9 × 10^{-6}], [1 × 10^{-5}, 9 × 10^{-5}], [1 × 10^{-4}, 9 × 10^{-4}], [1 × 10^{-3}, 9 × 10^{-3}]; and 100 increments of 0.01 between 0.01 and 1.0 [0.01, 1.0]. The results from all 150 × 150 = 22,500 SNP subsets are output to enable post-hoc examination of SNP overlap and concordance across the full range of statistical significance in dataset1 and dataset2.
Additionally, SECA generates quantile-quantile (Q-Q) and true discovery rate (TDR) plots for dataset2 p-values (P_{2}) stratified (conditioned) on dataset1 p-values (P_{1} ≤ {0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0}) to visualise whether there is an excess of overlapping ('pleiotropic') SNPs. A leftward shift in the Q-Q or TDR curve corresponds to an excess of SNPs with smaller p-values. In the presence of pleiotropy, we expect the Q-Q and TDR curves for dataset2 SNP p-values (P_{2}) to deviate further left of the identity line as we condition on them having smaller p-values in dataset1 (P_{1}).
Finally, 'pleiotropy-informed' conditional false discover rate (FDR) results are output for dataset2 p-values (P_{2}) stratified on dataset1 p-values (P_{1} ≤ {1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01, 0.001, 0.0001, 0.00001}). These results may identify SNPs associated in dataset2 after conditioning on their association significance (P_{1}) in dataset1 and/or prioritise SNPs for follow-up studies.
Input format:
The GWAS summary results should be uploaded as a text file with five
essential columns (tab or space-delimited).
Additional columns may be present, but they will not be used in the
analyses.
Each dataset file must contain the following five columns of data with
a header row listing (case insensitive) column names of 'SNP', 'EA',
'NEA', 'P', and 'BETA' or 'OR':
Column name | Description of data in column |
'SNP' | SNP name (i.e., rsID). Column names of 'MARKER', 'SNPID' or 'rs_number' are also accepted. |
'EA' | Effect allele. Column names of 'A1', 'REF' or 'reference_allele' are also accepted. |
'NEA' | Non-effect allele. Column names of 'A2', 'ALT' or 'other_allele' are also accepted. |
'P' | P-value from the association test. Column names of 'PVALUE', 'PVAL' or 'p-value' are also accepted. |
'BETA' or 'OR' | Regression coefficient (BETA) [BETA=log_{e}(OR)] or odds ratio (OR) [OR=exp(BETA)], for the effect allele (EA) relative to the non-effect allele (NEA). For example, a positive BETA (OR > 1) means EA increases risk relative to NEA. SECA will automatically convert OR values to BETA values. Column name of 'ZSCORE' is also accepted. |
References:
HapMap 3 (release 2) QC+ SNP genotype data formatted as PLINK files (hapmap3_r2_b36_fwd.qc.poly.tar.bz2) was downloaded from the Broad Institute's HapMap 3 webpage: http://www.broadinstitute.org/~debakker/hapmap3r2.html. HapMap 2 (release 23a) filtered (MAF > 0.01 and genotyping rate > 0.95) SNP genotype data (founders) formatted as PLINK files was downloaded from the PLINK resources webpage: http://pngu.mgh.harvard.edu/~purcell/plink/res.shtml. 1000 Genomes Project (1kGP) data (1000G PhaseI 2012 v3 Updated Integrated Phase 1 Release) was download from the MaCH download page and later converted to PLINK format. Note: only founder (unrelated) individuals are utilised in the LD estimation.
Nyholt DR (2014) SECA: SNP effect concordance analysis using genome-wide association summary results. Bioinformatics. 2014 Apr 1. [Epub ahead of print].
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007 81(3):559-75.
R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
SECA's SNP effect alignment is based on METAL (Willer et al 2010) source code (generic-metal-2011-03-25.tar.gz).
Willer CJ, Li Y,
Abecasis GR. METAL: fast and efficient meta-analysis
of genomewide association scans. Bioinformatics. 2010 26(17):2190-1.
Resources:
An R script ('SECA_Ranalysis_HeatmapPerm.R') allowing users to increase
the number of replicates (default = 1000) for the permutation of heatmap
p-values using the "independent_aligned_effects.txt" file generated by SECA can be
downloaded here.
SECA is coded by Dale R. Nyholt.
Copyright (C) 2013, Dale R. Nyholt. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. The names of its contributors may not be used to endorse or promote products derived from this software without specific prior written permission.
4. We also request that use of this software be cited in publications as: Nyholt DR (2014) SECA: SNP effect concordance analysis using genome-wide association summary results. Bioinformatics. 2014 Apr 1. [Epub ahead of print].
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Page last updated April 16, 2014.The Neurogenetics Laboratory | The Queensland Institute of Medical Research | Contact Us |