An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions

David J. Miller, Yanxin Zhang, Guoqiang Yu, Yongmei Liu, Li Chen, Carl D. Langefeld, David Herrington, Yue Wang

Research output: Contribution to journalArticle

Abstract

Motivation: In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical and methodological challenges for accurately identifying markers/ interactions and for building phenotype-predictive models. Results: We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is determined by the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the multiple k-way interactions between these markers. Additional key aspects are: (i) evaluation of a select subset of up to five-way interactions while retaining relatively low complexity; (ii) flexible single nucleotide polymorphism (SNP) coding (dominant, recessive) within each interaction; (iii) no mathematical interaction form assumed; (iv) model structure and order selection based on the Bayesian Information Criterion, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; (v) MECPM directly yields a phenotype-predictive model. MECPM was compared with a panel of methods on datasets with up to 1000 SNPs and up to eight embedded penetrance function (i.e. ground-truth) interactions, including a five-way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers and interactions, compared with previous methods.

Original languageEnglish (US)
Pages (from-to)2478-2485
Number of pages8
JournalBioinformatics
Volume25
Issue number19
DOIs
StatePublished - 2009
Externally publishedYes

Fingerprint

Probability Model
Entropy
Maximum Entropy
Genomics
Learning
Conditional probability
Single Nucleotide Polymorphism
Interaction
Model structures
Phenotype
Modeling
Predictive Model
Penetrance
Genome-Wide Association Study
Nucleotides
Polymorphism
Genetic Markers
Sample Size
Support vector machines
Genes

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability

Cite this

An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions. / Miller, David J.; Zhang, Yanxin; Yu, Guoqiang; Liu, Yongmei; Chen, Li; Langefeld, Carl D.; Herrington, David; Wang, Yue.

In: Bioinformatics, Vol. 25, No. 19, 2009, p. 2478-2485.

Research output: Contribution to journalArticle

Miller, David J. ; Zhang, Yanxin ; Yu, Guoqiang ; Liu, Yongmei ; Chen, Li ; Langefeld, Carl D. ; Herrington, David ; Wang, Yue. / An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions. In: Bioinformatics. 2009 ; Vol. 25, No. 19. pp. 2478-2485.
@article{dce41d8cf4834dbd9c00d542deb7231d,
title = "An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions",
abstract = "Motivation: In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical and methodological challenges for accurately identifying markers/ interactions and for building phenotype-predictive models. Results: We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is determined by the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the multiple k-way interactions between these markers. Additional key aspects are: (i) evaluation of a select subset of up to five-way interactions while retaining relatively low complexity; (ii) flexible single nucleotide polymorphism (SNP) coding (dominant, recessive) within each interaction; (iii) no mathematical interaction form assumed; (iv) model structure and order selection based on the Bayesian Information Criterion, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; (v) MECPM directly yields a phenotype-predictive model. MECPM was compared with a panel of methods on datasets with up to 1000 SNPs and up to eight embedded penetrance function (i.e. ground-truth) interactions, including a five-way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers and interactions, compared with previous methods.",
author = "Miller, {David J.} and Yanxin Zhang and Guoqiang Yu and Yongmei Liu and Li Chen and Langefeld, {Carl D.} and David Herrington and Yue Wang",
year = "2009",
doi = "10.1093/bioinformatics/btp435",
language = "English (US)",
volume = "25",
pages = "2478--2485",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "19",

}

TY - JOUR

T1 - An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions

AU - Miller, David J.

AU - Zhang, Yanxin

AU - Yu, Guoqiang

AU - Liu, Yongmei

AU - Chen, Li

AU - Langefeld, Carl D.

AU - Herrington, David

AU - Wang, Yue

PY - 2009

Y1 - 2009

N2 - Motivation: In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical and methodological challenges for accurately identifying markers/ interactions and for building phenotype-predictive models. Results: We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is determined by the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the multiple k-way interactions between these markers. Additional key aspects are: (i) evaluation of a select subset of up to five-way interactions while retaining relatively low complexity; (ii) flexible single nucleotide polymorphism (SNP) coding (dominant, recessive) within each interaction; (iii) no mathematical interaction form assumed; (iv) model structure and order selection based on the Bayesian Information Criterion, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; (v) MECPM directly yields a phenotype-predictive model. MECPM was compared with a panel of methods on datasets with up to 1000 SNPs and up to eight embedded penetrance function (i.e. ground-truth) interactions, including a five-way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers and interactions, compared with previous methods.

AB - Motivation: In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical and methodological challenges for accurately identifying markers/ interactions and for building phenotype-predictive models. Results: We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is determined by the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the multiple k-way interactions between these markers. Additional key aspects are: (i) evaluation of a select subset of up to five-way interactions while retaining relatively low complexity; (ii) flexible single nucleotide polymorphism (SNP) coding (dominant, recessive) within each interaction; (iii) no mathematical interaction form assumed; (iv) model structure and order selection based on the Bayesian Information Criterion, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; (v) MECPM directly yields a phenotype-predictive model. MECPM was compared with a panel of methods on datasets with up to 1000 SNPs and up to eight embedded penetrance function (i.e. ground-truth) interactions, including a five-way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers and interactions, compared with previous methods.

UR - http://www.scopus.com/inward/record.url?scp=70349885499&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70349885499&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btp435

DO - 10.1093/bioinformatics/btp435

M3 - Article

VL - 25

SP - 2478

EP - 2485

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 19

ER -