Clustering metagenomic sequences with interpolated Markov models

David R. Kelley; Steven L. Salzberg

doi:10.1186/1471-2105-11-544

Clustering metagenomic sequences with interpolated Markov models

David R. Kelley, Steven L. Salzberg

Research output: Contribution to journal › Article › peer-review

70 Scopus citations

Abstract

Background: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.Results: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available.Conclusions: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.

Original language	English (US)
Article number	544
Journal	BMC Bioinformatics
Volume	11
DOIs	https://doi.org/10.1186/1471-2105-11-544
State	Published - Nov 2 2010
Externally published	Yes

ASJC Scopus subject areas

Structural Biology
Biochemistry
Molecular Biology
Computer Science Applications
Applied Mathematics

Access to Document

10.1186/1471-2105-11-544

Cite this

@article{014a7bd6098649b19dc17624578ab8f2,

title = "Clustering metagenomic sequences with interpolated Markov models",

abstract = "Background: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.Results: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available.Conclusions: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.",

author = "Kelley, {David R.} and Salzberg, {Steven L.}",

note = "Funding Information: We thank Arthur Brady, Art Delcher, Mihai Pop, Carl Kingsford, Saket Navlakha, James White, and Adam Phillippy for valuable discussions on the method and the manuscript. This work was supported in part by the National Institutes of Health grants R01-LM006845 and R01-LM083873 to SLS.",

year = "2010",

month = nov,

day = "2",

doi = "10.1186/1471-2105-11-544",

language = "English (US)",

volume = "11",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

}

TY - JOUR

T1 - Clustering metagenomic sequences with interpolated Markov models

AU - Kelley, David R.

AU - Salzberg, Steven L.

N1 - Funding Information: We thank Arthur Brady, Art Delcher, Mihai Pop, Carl Kingsford, Saket Navlakha, James White, and Adam Phillippy for valuable discussions on the method and the manuscript. This work was supported in part by the National Institutes of Health grants R01-LM006845 and R01-LM083873 to SLS.

PY - 2010/11/2

Y1 - 2010/11/2

N2 - Background: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.Results: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available.Conclusions: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.

AB - Background: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.Results: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available.Conclusions: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.

UR - http://www.scopus.com/inward/record.url?scp=77958605377&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77958605377&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-11-544

DO - 10.1186/1471-2105-11-544

M3 - Article

C2 - 21044341

AN - SCOPUS:77958605377

SN - 1471-2105

VL - 11

JO - BMC Bioinformatics

JF - BMC Bioinformatics

M1 - 544

ER -

Clustering metagenomic sequences with interpolated Markov models

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this