Removing contaminants from databases of draft genomes

Jennifer Lu; Steven L. Salzberg

doi:10.1371/journal.pcbi.1006277

Removing contaminants from databases of draft genomes

Jennifer Lu, Steven L. Salzberg

School of Medicine

Research output: Contribution to journal › Article › peer-review

18 Scopus citations

Abstract

Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.

Original language	English (US)
Article number	e1006277
Journal	PLoS computational biology
Volume	14
Issue number	6
DOIs	https://doi.org/10.1371/journal.pcbi.1006277
State	Published - Jun 2018

ASJC Scopus subject areas

Ecology, Evolution, Behavior and Systematics
Modeling and Simulation
Ecology
Molecular Biology
Genetics
Cellular and Molecular Neuroscience
Computational Theory and Mathematics

Access to Document

10.1371/journal.pcbi.1006277

Cite this

@article{66a060552f4344449b89407a84708f91,

title = "Removing contaminants from databases of draft genomes",

abstract = "Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.",

author = "Jennifer Lu and Salzberg, {Steven L.}",

note = "Publisher Copyright: {\textcopyright} 2018 Lu, Salzberg. http://creativecommons.org/licenses/by/4.0/",

year = "2018",

month = jun,

doi = "10.1371/journal.pcbi.1006277",

language = "English (US)",

volume = "14",

journal = "PLoS computational biology",

issn = "1553-734X",

publisher = "Public Library of Science",

number = "6",

}

TY - JOUR

T1 - Removing contaminants from databases of draft genomes

AU - Lu, Jennifer

AU - Salzberg, Steven L.

PY - 2018/6

Y1 - 2018/6

N2 - Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.

AB - Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of “clean” eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.

UR - http://www.scopus.com/inward/record.url?scp=85049378897&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049378897&partnerID=8YFLogxK

U2 - 10.1371/journal.pcbi.1006277

DO - 10.1371/journal.pcbi.1006277

M3 - Article

C2 - 29939994

AN - SCOPUS:85049378897

SN - 1553-734X

VL - 14

JO - PLoS computational biology

JF - PLoS computational biology

IS - 6

M1 - e1006277

ER -

Removing contaminants from databases of draft genomes

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this