Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering

David R. Kelley, Bo Liu, Arthur L. Delcher, Mihai Pop, Steven L Salzberg

Research output: Contribution to journalArticle

Abstract

Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertiondeletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.

Original languageEnglish (US)
JournalNucleic Acids Research
Volume40
Issue number1
DOIs
StatePublished - Jan 2012

Fingerprint

Metagenomics
Metagenome
Cluster Analysis
Genes
Terminator Codon
Firearms
Human Body
Ecosystem
Software
Genome
Proteins

ASJC Scopus subject areas

  • Genetics

Cite this

Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. / Kelley, David R.; Liu, Bo; Delcher, Arthur L.; Pop, Mihai; Salzberg, Steven L.

In: Nucleic Acids Research, Vol. 40, No. 1, 01.2012.

Research output: Contribution to journalArticle

Kelley, David R. ; Liu, Bo ; Delcher, Arthur L. ; Pop, Mihai ; Salzberg, Steven L. / Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. In: Nucleic Acids Research. 2012 ; Vol. 40, No. 1.
@article{04abfd7003f94006b1fc4d05093e1061,
title = "Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering",
abstract = "Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertiondeletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.",
author = "Kelley, {David R.} and Bo Liu and Delcher, {Arthur L.} and Mihai Pop and Salzberg, {Steven L}",
year = "2012",
month = "1",
doi = "10.1093/nar/gkr1067",
language = "English (US)",
volume = "40",
journal = "Nucleic Acids Research",
issn = "1362-4962",
publisher = "Oxford University Press",
number = "1",

}

TY - JOUR

T1 - Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering

AU - Kelley, David R.

AU - Liu, Bo

AU - Delcher, Arthur L.

AU - Pop, Mihai

AU - Salzberg, Steven L

PY - 2012/1

Y1 - 2012/1

N2 - Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertiondeletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.

AB - Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertiondeletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.

UR - http://www.scopus.com/inward/record.url?scp=84855258501&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84855258501&partnerID=8YFLogxK

U2 - 10.1093/nar/gkr1067

DO - 10.1093/nar/gkr1067

M3 - Article

C2 - 22102569

AN - SCOPUS:84855258501

VL - 40

JO - Nucleic Acids Research

JF - Nucleic Acids Research

SN - 1362-4962

IS - 1

ER -