TFE 2011-2012 (final year project)

Feature selection as pre-screening tool for multifactor dimensionality reduction methods

Understanding the effects of genes and environmental factors on the development of complex diseases, such as cancer, is a major aim of genetic epidemiology. These kinds of diseases are controlled by complex molecular mechanisms characterized by the joint action of several genes, each having only a small effect. In this context traditional methods involving single markers have limited use and more advanced and efficient methods are needed to identify gene interactions or epistatic patterns.

The Multifactor Dimensionality Reduction method, MDR, (Ritchie et al. 2001) has achieved a great popularity. The MDR strategy tackles the dimensionality problem related to interaction detection and reduces the multiple dimensions to one by pooling multi-locus genotypes into two groups of risk: high and low. The method has been further improved and has led to an in-house developed similar strategy Model-Based Multifactor Dimensionality Reduction (MB-MDR – Calle et al 2008) with improved characteristics. The basic steps of MB-MDR are displayed in Figure 1.

Figure 1: Graphical overview of major MB-MDR steps.

Without parallel computing, performing MB-MDR at a large genome-wide scale (involving perhaps 1 million markers) is prohibitive. But also with parallel computing (Van Lishout et al 2011), and the emergence of genome-wide next generation sequencing results, a pre-selection of markers is mandatory in order to keep computation time and memory storage within limits. It will make a huge difference whether to investigate all possible couples or trios of markers in a group of N=250 or N=1,000,000 markers! Nevertheless, whether or not being successful in detecting higher-order genetic interactions, using N markers only, may heavily depend on the choice of the subset.

The topic of this project is to investigate several existing strategies to select favorable combinations of markers, so as to increase the power to identify important genetic interactions. Key in this investigation is to dwell upon the consequences of multi-stage analyses. In practice, the study will first involve performing a literature search about several feature selection methods that are currently available (e.g., wrapper models, filter models such as the established TuRF method of Moore and White 2007, or Bayesian methods such as those developed by Sebastiani et al 2008). A good starting point is the paper of Sayes et al (2007), but since then, additional strategies have been developed, perhaps more suitable for epistasis screening. The literature study should give you a good understanding of the pros and cons of the methods in the context of epistasis screening and the search for networks of interacting markers. Second, the performance of the most promising techniques needs to be assessed via simulations and compared to MB-MDR computational properties without pres-screening. Third, a real-life genome-wide data application is envisaged.

Depending on the approaches taken in this project, the thesis may actually lead to a publication in for instance BMC Genetics.


  • Calle, M.L., Urrea, V., Vellalta, G., Malats, N. & Van Steen, K. (2008) Model-Based Multifactor Dimensionality Reduction for detecting interactions in high-dimensional genomic data. U.O.V. Department of Systems Biology (ed.).
  • Dixon, M.S., Golstein, C., Thomas, C.M., Van Der Biezen, E.A. & Jones, J.D. (2000) Genetic complexity of pathogen perception by plants: the example of Rcr3, a tomato gene required specifically by Cf-2. Proc Natl Acad Sci U S A, 97, 8807-14.
  • Hardy, J. & Singleton, A. (2009) Genomewide association studies and human disease. N Engl J Med, 360, 1759-68.
  • Manolio, T.A., Brooks, L.D. & Collins, F.S. (2008) A HapMap harvest of insights into the genetics of common disease. J Clin Invest, 118, 1590-605.
  • Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., Mccarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., Cho, J.H., Guttmacher, A.E., Kong, A., Kruglyak, L., Mardis, E., Rotimi, C.N., Slatkin, M., Valle, D., Whittemore, A.S., Boehnke, M., Clark, A.G., Eichler, E.E., Gibson, G., Haines, J.L., Mackay, T.F., Mccarroll, S.A. & Visscher, P.M. (2009) Finding the missing heritability of complex diseases. Nature, 461, 747-53.