Application of two machine learning algorithms to genetic association studies in the presence of covariates

Bareng A.S. Nonyane, Andrea S. Foulkes

Research output: Contribution to journalArticle

Abstract

Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. Methods and Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.

Original languageEnglish (US)
Article number71
JournalBMC Genetics
Volume9
DOIs
StatePublished - Nov 14 2008
Externally publishedYes

Fingerprint

Genetic Association Studies
Genetic Polymorphisms
HIV-1
Genotype
Population
Genes
Machine Learning
Therapeutics

ASJC Scopus subject areas

  • Genetics
  • Genetics(clinical)

Cite this

Application of two machine learning algorithms to genetic association studies in the presence of covariates. / Nonyane, Bareng A.S.; Foulkes, Andrea S.

In: BMC Genetics, Vol. 9, 71, 14.11.2008.

Research output: Contribution to journalArticle

@article{e13948e9207b4143921236b6da138f48,
title = "Application of two machine learning algorithms to genetic association studies in the presence of covariates",
abstract = "Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. Methods and Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.",
author = "Nonyane, {Bareng A.S.} and Foulkes, {Andrea S.}",
year = "2008",
month = "11",
day = "14",
doi = "10.1186/1471-2156-9-71",
language = "English (US)",
volume = "9",
journal = "BMC Genetics",
issn = "1471-2156",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Application of two machine learning algorithms to genetic association studies in the presence of covariates

AU - Nonyane, Bareng A.S.

AU - Foulkes, Andrea S.

PY - 2008/11/14

Y1 - 2008/11/14

N2 - Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. Methods and Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.

AB - Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. Methods and Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.

UR - http://www.scopus.com/inward/record.url?scp=58149343568&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=58149343568&partnerID=8YFLogxK

U2 - 10.1186/1471-2156-9-71

DO - 10.1186/1471-2156-9-71

M3 - Article

C2 - 19014573

AN - SCOPUS:58149343568

VL - 9

JO - BMC Genetics

JF - BMC Genetics

SN - 1471-2156

M1 - 71

ER -