Terminating contamination: Large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Martin Steinegger, Steven L. Salzberg

Research output: Contribution to journalArticlepeer-review

Abstract

Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in 114,035 sequences and 2767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746 sequences and 6795 species in the GenBank database, and 14,132 protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator

Original languageEnglish (US)
JournalUnknown Journal
DOIs
StatePublished - Jan 26 2020

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)
  • Immunology and Microbiology(all)
  • Neuroscience(all)
  • Pharmacology, Toxicology and Pharmaceutics(all)

Fingerprint Dive into the research topics of 'Terminating contamination: Large-scale search identifies more than 2,000,000 contaminated entries in GenBank'. Together they form a unique fingerprint.

Cite this