- Home
- About
- Collaborative Data Access (SAGs)
- Publications
- For Members
The initial success of the human genome project has been the (nearly) complete characterization of the consensus human sequence. This has greatly increased our ability to identify and describe the genomic structure of genes. A second major achievement is characterizing genetic variability both in identifying large numbers of SNPs and more recently, insertions and deletions known as copy number variations (CNVs). Multiple technologies now allow a GWA design to be implemented by genotyping 500,000-1,000,000 SNPs with high fidelity and low cost. The alleles, genotypes, or haplotypes of these SNPs are tested directly for association with disease. Estimates suggest that with 500,000 SNPs, 85-92% of the common Caucasian variation in the genome will be captured. The same genotyping platforms also capture CNV information. Thus, GWA is by far the most detailed and complete method of whole genome interrogation currently available. GWAS has already been validated with the recent identification of the highly significant effect of the Y402H polymorphism in the complement component H (CFH) gene in age-related macular degeneration. This was found simultaneously through GWA48 and targeted positional candidate approaches. It is important to point out, however, that the small sample sizes used for that GWA study were only sufficient because the effect of the Y402H polymorphism was unexpectedly large. Subsequently, the use of GWA methods has rapidly lead to susceptibility gene identification in type 2 diabetes, type 1 diabetes, breast cancer, multiple sclerosis, Crohn’s disease, colorectal cancer, prostate cancer, and others.
There is no paradigm for the analysis of GWAS data. An increasing number of publications have attempted to address very specific analytical issues. For example, exhaustive allelic transmission disequilibrium tests (EATDT) can be used to examine all possible single locus and haplotype combinations in a computationally efficient manner, but is restricted to parent-child trio data. Classical statistical methods such as the chi-square test, logistic regression, and the Armitage trend test are commonly used for case-control data association studies. A problem with such an initial analytical approach is the large number of expected false positive results. Using a nominal α=0.05 on the 500,000 SNPs results in an average of 25,000 false positive results. Much has been written about the problem of how to correct for the vast number of single locus tests being performed, but consensus has not yet emerged. A Bonferroni correction is clearly too conservative for several reasons including the fact that it assumes the independence of each test even though many of the SNPs are in linkage disequilibrium (LD) and thus correlated with each other. Substantial effort has been devoted to developing alternatives to Bonferroni correction for multiple testing. Many of these methods are promising and much research is ongoing.
While we view corrections for multiple testing as an important aspect of the statistical analysis, for our paradigm the exact nature of the correction is not critical since these analyses are only one part of our overall study design. Skol et al. (Nat. Genet. 2006;38:209-213) examined the use of joint analysis as a more efficient approach to two-stage GWA than replication based analysis. They showed joint analysis of both stages of the data resulted in increased power to detect genetic association, even when effect sizes differ between the two stages. Even with this added power, ultimately, supplementary data need to be used to filter results down to a manageable number of the most likely genes to undergo comprehensive molecular analysis.