Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models

Arthur Brady, Steven L Salzberg

Research output: Contribution to journalArticle

Abstract

Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.

Original languageEnglish (US)
Pages (from-to)673-676
Number of pages4
JournalNature Methods
Volume6
Issue number9
DOIs
StatePublished - 2009
Externally publishedYes

Fingerprint

Metagenomics
Genes
Technology
Base Pairing
Biodiversity
Genome
Sequence Alignment
Classifiers
DNA
Research Personnel
Chemical analysis

ASJC Scopus subject areas

  • Biotechnology
  • Molecular Biology
  • Biochemistry
  • Cell Biology

Cite this

Phymm and PhymmBL : Metagenomic phylogenetic classification with interpolated Markov models. / Brady, Arthur; Salzberg, Steven L.

In: Nature Methods, Vol. 6, No. 9, 2009, p. 673-676.

Research output: Contribution to journalArticle

@article{5d395d5aae3f4a5487d0f8583bea107b,
title = "Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models",
abstract = "Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.",
author = "Arthur Brady and Salzberg, {Steven L}",
year = "2009",
doi = "10.1038/nmeth.1358",
language = "English (US)",
volume = "6",
pages = "673--676",
journal = "Nature Clinical Practice Oncology",
issn = "1759-4774",
publisher = "Nature Publishing Group",
number = "9",

}

TY - JOUR

T1 - Phymm and PhymmBL

T2 - Metagenomic phylogenetic classification with interpolated Markov models

AU - Brady, Arthur

AU - Salzberg, Steven L

PY - 2009

Y1 - 2009

N2 - Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.

AB - Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.

UR - http://www.scopus.com/inward/record.url?scp=69549135124&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=69549135124&partnerID=8YFLogxK

U2 - 10.1038/nmeth.1358

DO - 10.1038/nmeth.1358

M3 - Article

C2 - 19648916

AN - SCOPUS:69549135124

VL - 6

SP - 673

EP - 676

JO - Nature Clinical Practice Oncology

JF - Nature Clinical Practice Oncology

SN - 1759-4774

IS - 9

ER -