Using protein domains to improve the accuracy of Ab Initio gene finding

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Background: Protein domains are the common functional elements used by nature to generate tremendous diversity among proteins, and they are used repeatedly in different combinations across all major domains of life. In this paper we address the problem of using similarity to known protein domains in helping with the identification of genes in a DNA sequence. We have adapted the generalized hidden Markov model (GHMM) architecture of the ab intio gene finder GlimmerHMM such that a higher probability is assigned to exons that contain homologues to protein domains. To our knowledge, this domain homology based approach has not been used previously in the context of ab initio gene prediction. Results: GlimmerHMM was augmented with a protein domain module that recognizes gene structures that are similar to Pfam models. The augmented system, GlimmerHMM+, shows 2% improvement in sensitivity and a 1% increase in specificity in predicting exact gene structures compared to GlimmerHMM without this option. These results were obtained on two very different model organisms: Arabidopsis thaliana (mustard wee) and Danio rerio (zebrafish), and together these preliminary results demonstrate the value of using protein domain homology in gene prediction. The results obtained are encouraging, and we believe that a more comprehensive approach including a model that reflects the statistical characteristics of specific sets of protein domain families would result in a greater increase of the accuracy of gene prediction. GlimmerHMM and GlimmerHMM+ are freely available as open source software at http://cbcb.umd.edu/software.

Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages208-215
Number of pages8
Volume4645 LNBI
StatePublished - 2007
Externally publishedYes
Event7th International Workshop on Algorithms in Bioinformatics, WABI 2007 - PhiIadelphia, PA, United States
Duration: Sep 8 2007Sep 9 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4645 LNBI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other7th International Workshop on Algorithms in Bioinformatics, WABI 2007
CountryUnited States
CityPhiIadelphia, PA
Period9/8/079/9/07

Fingerprint

Genes
Gene
Proteins
Protein
Zebrafish
Software
Homology
Prediction
Mustard Plant
Gene Regulatory Networks
Statistical Models
Arabidopsis
Augmented System
DNA sequences
Arabidopsis Thaliana
Hidden Markov models
Protein Domains
Open Source Software
Exons
DNA Sequence

Keywords

  • ab intio gene finding
  • GHMM
  • Pfam
  • Profile HMM
  • Protein domain

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Pertea, M., & Salzberg, S. L. (2007). Using protein domains to improve the accuracy of Ab Initio gene finding. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4645 LNBI, pp. 208-215). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4645 LNBI).

Using protein domains to improve the accuracy of Ab Initio gene finding. / Pertea, Mihaela; Salzberg, Steven L.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4645 LNBI 2007. p. 208-215 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4645 LNBI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pertea, M & Salzberg, SL 2007, Using protein domains to improve the accuracy of Ab Initio gene finding. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 4645 LNBI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4645 LNBI, pp. 208-215, 7th International Workshop on Algorithms in Bioinformatics, WABI 2007, PhiIadelphia, PA, United States, 9/8/07.
Pertea M, Salzberg SL. Using protein domains to improve the accuracy of Ab Initio gene finding. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4645 LNBI. 2007. p. 208-215. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Pertea, Mihaela ; Salzberg, Steven L. / Using protein domains to improve the accuracy of Ab Initio gene finding. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4645 LNBI 2007. pp. 208-215 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{a6c96a4ab1014a09a68bdf4b3ba8174c,
title = "Using protein domains to improve the accuracy of Ab Initio gene finding",
abstract = "Background: Protein domains are the common functional elements used by nature to generate tremendous diversity among proteins, and they are used repeatedly in different combinations across all major domains of life. In this paper we address the problem of using similarity to known protein domains in helping with the identification of genes in a DNA sequence. We have adapted the generalized hidden Markov model (GHMM) architecture of the ab intio gene finder GlimmerHMM such that a higher probability is assigned to exons that contain homologues to protein domains. To our knowledge, this domain homology based approach has not been used previously in the context of ab initio gene prediction. Results: GlimmerHMM was augmented with a protein domain module that recognizes gene structures that are similar to Pfam models. The augmented system, GlimmerHMM+, shows 2{\%} improvement in sensitivity and a 1{\%} increase in specificity in predicting exact gene structures compared to GlimmerHMM without this option. These results were obtained on two very different model organisms: Arabidopsis thaliana (mustard wee) and Danio rerio (zebrafish), and together these preliminary results demonstrate the value of using protein domain homology in gene prediction. The results obtained are encouraging, and we believe that a more comprehensive approach including a model that reflects the statistical characteristics of specific sets of protein domain families would result in a greater increase of the accuracy of gene prediction. GlimmerHMM and GlimmerHMM+ are freely available as open source software at http://cbcb.umd.edu/software.",
keywords = "ab intio gene finding, GHMM, Pfam, Profile HMM, Protein domain",
author = "Mihaela Pertea and Salzberg, {Steven L}",
year = "2007",
language = "English (US)",
isbn = "9783540741251",
volume = "4645 LNBI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "208--215",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Using protein domains to improve the accuracy of Ab Initio gene finding

AU - Pertea, Mihaela

AU - Salzberg, Steven L

PY - 2007

Y1 - 2007

N2 - Background: Protein domains are the common functional elements used by nature to generate tremendous diversity among proteins, and they are used repeatedly in different combinations across all major domains of life. In this paper we address the problem of using similarity to known protein domains in helping with the identification of genes in a DNA sequence. We have adapted the generalized hidden Markov model (GHMM) architecture of the ab intio gene finder GlimmerHMM such that a higher probability is assigned to exons that contain homologues to protein domains. To our knowledge, this domain homology based approach has not been used previously in the context of ab initio gene prediction. Results: GlimmerHMM was augmented with a protein domain module that recognizes gene structures that are similar to Pfam models. The augmented system, GlimmerHMM+, shows 2% improvement in sensitivity and a 1% increase in specificity in predicting exact gene structures compared to GlimmerHMM without this option. These results were obtained on two very different model organisms: Arabidopsis thaliana (mustard wee) and Danio rerio (zebrafish), and together these preliminary results demonstrate the value of using protein domain homology in gene prediction. The results obtained are encouraging, and we believe that a more comprehensive approach including a model that reflects the statistical characteristics of specific sets of protein domain families would result in a greater increase of the accuracy of gene prediction. GlimmerHMM and GlimmerHMM+ are freely available as open source software at http://cbcb.umd.edu/software.

AB - Background: Protein domains are the common functional elements used by nature to generate tremendous diversity among proteins, and they are used repeatedly in different combinations across all major domains of life. In this paper we address the problem of using similarity to known protein domains in helping with the identification of genes in a DNA sequence. We have adapted the generalized hidden Markov model (GHMM) architecture of the ab intio gene finder GlimmerHMM such that a higher probability is assigned to exons that contain homologues to protein domains. To our knowledge, this domain homology based approach has not been used previously in the context of ab initio gene prediction. Results: GlimmerHMM was augmented with a protein domain module that recognizes gene structures that are similar to Pfam models. The augmented system, GlimmerHMM+, shows 2% improvement in sensitivity and a 1% increase in specificity in predicting exact gene structures compared to GlimmerHMM without this option. These results were obtained on two very different model organisms: Arabidopsis thaliana (mustard wee) and Danio rerio (zebrafish), and together these preliminary results demonstrate the value of using protein domain homology in gene prediction. The results obtained are encouraging, and we believe that a more comprehensive approach including a model that reflects the statistical characteristics of specific sets of protein domain families would result in a greater increase of the accuracy of gene prediction. GlimmerHMM and GlimmerHMM+ are freely available as open source software at http://cbcb.umd.edu/software.

KW - ab intio gene finding

KW - GHMM

KW - Pfam

KW - Profile HMM

KW - Protein domain

UR - http://www.scopus.com/inward/record.url?scp=37249023663&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=37249023663&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:37249023663

SN - 9783540741251

VL - 4645 LNBI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 208

EP - 215

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -