A preprocessor for shotgun assembly of large genomes

Michael Roberts, Brian R. Hunt, James A. Yorke, Randall A. Bolanos, Arthur L. Delcher

Research output: Contribution to journalArticlepeer-review

26 Scopus citations


The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a "read." Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of "overlaps," i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the "UMD Overlapper," can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera's Drosophila reads. When we replaced Celera's overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.

Original languageEnglish (US)
Pages (from-to)734-752
Number of pages19
JournalJournal of Computational Biology
Issue number4
StatePublished - 2004


  • DNA fragment overlap determination
  • Disk-based sorting
  • Sequencing error correction
  • Whole genome shotgun assembly

ASJC Scopus subject areas

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics


Dive into the research topics of 'A preprocessor for shotgun assembly of large genomes'. Together they form a unique fingerprint.

Cite this