TFE 2010-2011 (final year project)
The influence of linkage disequilibrium between markers on the outcome of an epistasis screening
Many common human diseases and traits are believed to be influenced by several genetic and environmental factors, each factor potentially having a modifying effect on the other. Understanding the interplay between genetic and non-genetic factors that underlies these complex diseases and traits is one of the major goals of genetic epidemiology. In genetic association studies for common complex diseases, single nucleotide polymorphisms (SNPs) are the most commonly used type of genetic markers (Marnellos, 2003). This is in part understood by their dense distribution across the genome and their low mutation rate. Genome-wide association analysis (GWA), using a dense map of SNPs, has become one of the standard approaches for disentangling the genetic basis of complex genetic diseases (Hardy & Singleton, 2009). Despite the fact that GWAs have provided convincing evidence for identifying important genetic variants influencing a wide variety of common diseases and traits (Manolio et al., 2008, Seng & Seng, 2008), a lot of the genetic heritability cannot be explained by the (major) genetic loci discovered so far (Manolio et al., 2009). This may be attributed to the fact that reality shows multiple small associations, whereas common statistical techniques in this context only exhibit sufficient power to detect moderate to large associations. Also, looking beyond singular genetic effects and beyond the boundaries of additive inheritance of SNP polymorphisms should better reflect biological pathways that are involved in disease etiology (Dixon et al., 2000).
Analyzing the combined effects of genes and/or environmental factors on the development of complex diseases is a great challenge from both the statistical and computational perspective, even using a relatively small number of genetic and non-genetic exposures. Several data mining methods have been proposed for interaction analysis, among them, the Multifactor Dimensionality Reduction Method (MDR, Ritchie et al 2001, 2003), which has proven its utility in a variety of theoretical and practical settings. Model-Based Multifactor Dimensionality Reduction (MB-MDR, Calle et al 2008), a relatively new MDR-based technique that is able to unify the best of both non-parametric and parametric worlds, was developed to address some of the remaining concerns that go along with an MDR-analysis. These include the restriction to univariate, dichotomous traits, the absence of flexible ways to adjust for lower-order effects and important confounders, and the difficulty to highlight epistasis effects when too many multi-locus genotype cells are pooled into two new genotype groups. Whereas the true value of MB-MDR can only reveal itself by extensive applications of the method in a variety of real-life scenarios, here we investigate the empirical power of MB-MDR to detect gene-gene interactions in the absence of any noise and in the presence of genotyping error, missing data, phenocopies, and genetic heterogeneity. It has been shown in Cattaert et al (2010 – in submission) that the power of MB-MDR is generally higher than for MDR, in particular in the presence of genetic heterogeneity, phenocopies, or low minor allele frequencies.
The topic of this thesis is to investigate the performance of MB-MDR in scenarios of indirect associations, in which one or all of the actual causal loci are not observed directly. In such scenarios, the MB-MDR analysis is based upon markers that are in high linkage disequilibrium (LD) with the causal loci. Restriction will be made to dichotomous traits only. First, data need to be simulated from different epistasis models (Ritchie et al 2001) encompassing the aforementioned scenarios. Second, the simulated data are used to estimate the power of MB-MDR in the considered scenarios. Third, the obtained results are compared with those where the actual causal loci have been observed. Fourth, the results are encapsulated in a broader context: How do these results alter the view on the use of tagging SNPs in epistasis analysis? A nice publication in this context is Chapman et al (2007).
I expect that this work will give a proof of concept in that currently available tools for epistasis analysis should be applied to whole sequence or re-sequencing data instead. But in order to make this feasible, extra measures need to be taken to adequately deal with dependent markers and to make the epistasis screening tools computationally efficient and fast. Therefore, this thesis may lead to a first scientific publication.
See van_steen_3_2011_the_influence_ld_epistasis_screening.doc for references and figures.