Identifying SNPs predictive of phenotype using random forests

Alexandre Bureau, Josée Dupuis, Kathleen Falls, Kathryn L. Lunetta, Brooke Hayward, Tim P. Keith, Paul Van Eerdewegh

Research output: Contribution to journalArticle

Abstract

There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneously without requiring prohibitively large sample sizes. By contrast, high-dimensional nonparametric methods thrive on large numbers of predictors. This work explores the application of one such method, random forests, to the problem of identifying SNPs predictive of the phenotype in the case-control study design. A random forest is a collection of classification trees grown on bootstrap samples of observations, using a random subset of predictors to define the best split at each node. The observations left out of the bootstrap samples are used to estimate prediction error. The importance of a predictor is quantified by the increase in misclassification occurring when the values of the predictor are randomly permuted. We extend the concept of importance to pairs of predictors, to capture joint effects, and we explore the behavior of importance measures over a range of two-locus disease models in the presence of a varying number of SNPs unassociated with the phenotype. We illustrate the application of random forests with a data set of asthma cases and unaffected controls genotyped at 42 SNPs in ADAM33, a previously identified asthma susceptibility gene. SNPs and SNP pairs highly associated with asthma tend to have the highest importance index value, but predictive importance and association do not always coincide.

Original languageEnglish (US)
Pages (from-to)171-182
Number of pages12
JournalGenetic epidemiology
Volume28
Issue number2
DOIs
StatePublished - Feb 1 2005
Externally publishedYes

Fingerprint

Single Nucleotide Polymorphism
Phenotype
Asthma
Genes
Chromosome Mapping
Disease Susceptibility
Forests
Sample Size
Case-Control Studies

Keywords

  • Case-control study
  • Classification trees
  • Genotype-phenotype association
  • Predictive importance

ASJC Scopus subject areas

  • Genetics(clinical)
  • Epidemiology

Cite this

Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P., & Van Eerdewegh, P. (2005). Identifying SNPs predictive of phenotype using random forests. Genetic epidemiology, 28(2), 171-182. https://doi.org/10.1002/gepi.20041

Identifying SNPs predictive of phenotype using random forests. / Bureau, Alexandre; Dupuis, Josée; Falls, Kathleen; Lunetta, Kathryn L.; Hayward, Brooke; Keith, Tim P.; Van Eerdewegh, Paul.

In: Genetic epidemiology, Vol. 28, No. 2, 01.02.2005, p. 171-182.

Research output: Contribution to journalArticle

Bureau, A, Dupuis, J, Falls, K, Lunetta, KL, Hayward, B, Keith, TP & Van Eerdewegh, P 2005, 'Identifying SNPs predictive of phenotype using random forests', Genetic epidemiology, vol. 28, no. 2, pp. 171-182. https://doi.org/10.1002/gepi.20041
Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP et al. Identifying SNPs predictive of phenotype using random forests. Genetic epidemiology. 2005 Feb 1;28(2):171-182. https://doi.org/10.1002/gepi.20041
Bureau, Alexandre ; Dupuis, Josée ; Falls, Kathleen ; Lunetta, Kathryn L. ; Hayward, Brooke ; Keith, Tim P. ; Van Eerdewegh, Paul. / Identifying SNPs predictive of phenotype using random forests. In: Genetic epidemiology. 2005 ; Vol. 28, No. 2. pp. 171-182.
@article{2acdc19fb38c46cca3fd7977d2c6b885,
title = "Identifying SNPs predictive of phenotype using random forests",
abstract = "There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneously without requiring prohibitively large sample sizes. By contrast, high-dimensional nonparametric methods thrive on large numbers of predictors. This work explores the application of one such method, random forests, to the problem of identifying SNPs predictive of the phenotype in the case-control study design. A random forest is a collection of classification trees grown on bootstrap samples of observations, using a random subset of predictors to define the best split at each node. The observations left out of the bootstrap samples are used to estimate prediction error. The importance of a predictor is quantified by the increase in misclassification occurring when the values of the predictor are randomly permuted. We extend the concept of importance to pairs of predictors, to capture joint effects, and we explore the behavior of importance measures over a range of two-locus disease models in the presence of a varying number of SNPs unassociated with the phenotype. We illustrate the application of random forests with a data set of asthma cases and unaffected controls genotyped at 42 SNPs in ADAM33, a previously identified asthma susceptibility gene. SNPs and SNP pairs highly associated with asthma tend to have the highest importance index value, but predictive importance and association do not always coincide.",
keywords = "Case-control study, Classification trees, Genotype-phenotype association, Predictive importance",
author = "Alexandre Bureau and Jos{\'e}e Dupuis and Kathleen Falls and Lunetta, {Kathryn L.} and Brooke Hayward and Keith, {Tim P.} and {Van Eerdewegh}, Paul",
year = "2005",
month = "2",
day = "1",
doi = "10.1002/gepi.20041",
language = "English (US)",
volume = "28",
pages = "171--182",
journal = "Genetic Epidemiology",
issn = "0741-0395",
publisher = "Wiley-Liss Inc.",
number = "2",

}

TY - JOUR

T1 - Identifying SNPs predictive of phenotype using random forests

AU - Bureau, Alexandre

AU - Dupuis, Josée

AU - Falls, Kathleen

AU - Lunetta, Kathryn L.

AU - Hayward, Brooke

AU - Keith, Tim P.

AU - Van Eerdewegh, Paul

PY - 2005/2/1

Y1 - 2005/2/1

N2 - There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneously without requiring prohibitively large sample sizes. By contrast, high-dimensional nonparametric methods thrive on large numbers of predictors. This work explores the application of one such method, random forests, to the problem of identifying SNPs predictive of the phenotype in the case-control study design. A random forest is a collection of classification trees grown on bootstrap samples of observations, using a random subset of predictors to define the best split at each node. The observations left out of the bootstrap samples are used to estimate prediction error. The importance of a predictor is quantified by the increase in misclassification occurring when the values of the predictor are randomly permuted. We extend the concept of importance to pairs of predictors, to capture joint effects, and we explore the behavior of importance measures over a range of two-locus disease models in the presence of a varying number of SNPs unassociated with the phenotype. We illustrate the application of random forests with a data set of asthma cases and unaffected controls genotyped at 42 SNPs in ADAM33, a previously identified asthma susceptibility gene. SNPs and SNP pairs highly associated with asthma tend to have the highest importance index value, but predictive importance and association do not always coincide.

AB - There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneously without requiring prohibitively large sample sizes. By contrast, high-dimensional nonparametric methods thrive on large numbers of predictors. This work explores the application of one such method, random forests, to the problem of identifying SNPs predictive of the phenotype in the case-control study design. A random forest is a collection of classification trees grown on bootstrap samples of observations, using a random subset of predictors to define the best split at each node. The observations left out of the bootstrap samples are used to estimate prediction error. The importance of a predictor is quantified by the increase in misclassification occurring when the values of the predictor are randomly permuted. We extend the concept of importance to pairs of predictors, to capture joint effects, and we explore the behavior of importance measures over a range of two-locus disease models in the presence of a varying number of SNPs unassociated with the phenotype. We illustrate the application of random forests with a data set of asthma cases and unaffected controls genotyped at 42 SNPs in ADAM33, a previously identified asthma susceptibility gene. SNPs and SNP pairs highly associated with asthma tend to have the highest importance index value, but predictive importance and association do not always coincide.

KW - Case-control study

KW - Classification trees

KW - Genotype-phenotype association

KW - Predictive importance

UR - http://www.scopus.com/inward/record.url?scp=12744259874&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=12744259874&partnerID=8YFLogxK

U2 - 10.1002/gepi.20041

DO - 10.1002/gepi.20041

M3 - Article

C2 - 15593090

AN - SCOPUS:12744259874

VL - 28

SP - 171

EP - 182

JO - Genetic Epidemiology

JF - Genetic Epidemiology

SN - 0741-0395

IS - 2

ER -