An empirical analysis of training protocols for probabilistic gene finders

William H. Majoros; Steven L. Salzberg

doi:10.1186/1471-2105-5-206

An empirical analysis of training protocols for probabilistic gene finders

William H. Majoros, Steven L. Salzberg

Research output: Contribution to journal › Article › peer-review

16 Scopus citations

Abstract

Background: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. Results: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. Conclusions: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements.

Original language	English (US)
Article number	206
Journal	BMC Bioinformatics
Volume	5
DOIs	https://doi.org/10.1186/1471-2105-5-206
State	Published - Dec 21 2004
Externally published	Yes

ASJC Scopus subject areas

Structural Biology
Biochemistry
Molecular Biology
Computer Science Applications
Applied Mathematics

Access to Document

10.1186/1471-2105-5-206

Cite this

@article{39fac0493d284625a8b7b1d66ae11213,

title = "An empirical analysis of training protocols for probabilistic gene finders",

abstract = "Background: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. Results: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. Conclusions: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements.",

author = "Majoros, {William H.} and Salzberg, {Steven L.}",

year = "2004",

month = dec,

day = "21",

doi = "10.1186/1471-2105-5-206",

language = "English (US)",

volume = "5",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

}

TY - JOUR

T1 - An empirical analysis of training protocols for probabilistic gene finders

AU - Majoros, William H.

AU - Salzberg, Steven L.

PY - 2004/12/21

Y1 - 2004/12/21

N2 - Background: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. Results: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. Conclusions: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements.

AB - Background: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. Results: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. Conclusions: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements.

UR - http://www.scopus.com/inward/record.url?scp=13244268374&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=13244268374&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-5-206

DO - 10.1186/1471-2105-5-206

M3 - Article

C2 - 15613242

AN - SCOPUS:13244268374

SN - 1471-2105

VL - 5

JO - BMC Bioinformatics

JF - BMC Bioinformatics

M1 - 206

ER -

An empirical analysis of training protocols for probabilistic gene finders

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this