Application of two machine learning algorithms to genetic association studies in the presence of covariates

Bareng A.S. Nonyane; Andrea S. Foulkes

doi:10.1186/1471-2156-9-71

Application of two machine learning algorithms to genetic association studies in the presence of covariates

Bareng A.S. Nonyane, Andrea S. Foulkes

Research output: Contribution to journal › Article › peer-review

10 Scopus citations

Abstract

Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. Methods and Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.

Original language	English (US)
Article number	71
Journal	BMC genetics
Volume	9
DOIs	https://doi.org/10.1186/1471-2156-9-71
State	Published - Nov 14 2008
Externally published	Yes

ASJC Scopus subject areas

Genetics
Genetics(clinical)

Access to Document

10.1186/1471-2156-9-71

Cite this

@article{e13948e9207b4143921236b6da138f48,

title = "Application of two machine learning algorithms to genetic association studies in the presence of covariates",

abstract = "Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. Methods and Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.",

author = "Nonyane, {Bareng A.S.} and Foulkes, {Andrea S.}",

note = "Funding Information: Support for this research was provided by a National Institute of Allergy and Infectious Diseases (NIAID) Independent Research Award to ASF (R01 AI056983). This research was also supported in part by an NIH/NIDDK Research Award (R01 DK021224) and the Adult AIDS Clinical Trials Group (ACTG) funded by the NIAID (AI38858).",

year = "2008",

month = nov,

day = "14",

doi = "10.1186/1471-2156-9-71",

language = "English (US)",

volume = "9",

journal = "BMC genetics",

issn = "1471-2156",

publisher = "BioMed Central",

}

TY - JOUR

T1 - Application of two machine learning algorithms to genetic association studies in the presence of covariates

AU - Nonyane, Bareng A.S.

AU - Foulkes, Andrea S.

N1 - Funding Information: Support for this research was provided by a National Institute of Allergy and Infectious Diseases (NIAID) Independent Research Award to ASF (R01 AI056983). This research was also supported in part by an NIH/NIDDK Research Award (R01 DK021224) and the Adult AIDS Clinical Trials Group (ACTG) funded by the NIAID (AI38858).

PY - 2008/11/14

Y1 - 2008/11/14

N2 - Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. Methods and Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.

AB - Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. Methods and Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.

UR - http://www.scopus.com/inward/record.url?scp=58149343568&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=58149343568&partnerID=8YFLogxK

U2 - 10.1186/1471-2156-9-71

DO - 10.1186/1471-2156-9-71

M3 - Article

C2 - 19014573

AN - SCOPUS:58149343568

SN - 1471-2156

VL - 9

JO - BMC genetics

JF - BMC genetics

M1 - 71

ER -

Application of two machine learning algorithms to genetic association studies in the presence of covariates

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this