Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies

Andrei S. Rodin, Anatoliy Litvinenko, Kathy Klos, Alanna C. Morrison, Trevor Woodage, Josef Coresh, Eric Boerwinkle

Research output: Contribution to journalArticle

Abstract

Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a "wrapper" strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.

Original languageEnglish (US)
Pages (from-to)1705-1718
Number of pages14
JournalJournal of Computational Biology
Volume16
Issue number12
DOIs
StatePublished - Dec 1 2009

Fingerprint

Wrapper
Single nucleotide Polymorphism
Random Forest
Variable Selection
Nucleotides
Polymorphism
Single Nucleotide Polymorphism
Genomics
Classifiers
Classifier
Coronary Heart Disease
Lipoproteins
Coronary Disease
Ranking
LDL Lipoproteins
Subset
Optimization Algorithm
Genes
Extremal Optimization
Atherosclerosis

Keywords

  • Coronary heart disease
  • Genome-wide association studies
  • Random forests classifier
  • SNPs
  • Variable selection

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Modeling and Simulation
  • Computational Theory and Mathematics

Cite this

Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies. / Rodin, Andrei S.; Litvinenko, Anatoliy; Klos, Kathy; Morrison, Alanna C.; Woodage, Trevor; Coresh, Josef; Boerwinkle, Eric.

In: Journal of Computational Biology, Vol. 16, No. 12, 01.12.2009, p. 1705-1718.

Research output: Contribution to journalArticle

Rodin, Andrei S. ; Litvinenko, Anatoliy ; Klos, Kathy ; Morrison, Alanna C. ; Woodage, Trevor ; Coresh, Josef ; Boerwinkle, Eric. / Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies. In: Journal of Computational Biology. 2009 ; Vol. 16, No. 12. pp. 1705-1718.
@article{b309b46058bf4b8a972e6a73c0f8e89a,
title = "Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies",
abstract = "Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a {"}wrapper{"} strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.",
keywords = "Coronary heart disease, Genome-wide association studies, Random forests classifier, SNPs, Variable selection",
author = "Rodin, {Andrei S.} and Anatoliy Litvinenko and Kathy Klos and Morrison, {Alanna C.} and Trevor Woodage and Josef Coresh and Eric Boerwinkle",
year = "2009",
month = "12",
day = "1",
doi = "10.1089/cmb.2008.0037",
language = "English (US)",
volume = "16",
pages = "1705--1718",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "12",

}

TY - JOUR

T1 - Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies

AU - Rodin, Andrei S.

AU - Litvinenko, Anatoliy

AU - Klos, Kathy

AU - Morrison, Alanna C.

AU - Woodage, Trevor

AU - Coresh, Josef

AU - Boerwinkle, Eric

PY - 2009/12/1

Y1 - 2009/12/1

N2 - Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a "wrapper" strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.

AB - Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a "wrapper" strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.

KW - Coronary heart disease

KW - Genome-wide association studies

KW - Random forests classifier

KW - SNPs

KW - Variable selection

UR - http://www.scopus.com/inward/record.url?scp=75149176440&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=75149176440&partnerID=8YFLogxK

U2 - 10.1089/cmb.2008.0037

DO - 10.1089/cmb.2008.0037

M3 - Article

VL - 16

SP - 1705

EP - 1718

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 12

ER -