A preprocessor for shotgun assembly of large genomes

Michael Roberts, Brian R. Hunt, James A. Yorke, Randall A. Bolanos, Arthur L. Delcher

Research output: Contribution to journalArticle

Abstract

The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a "read." Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of "overlaps," i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the "UMD Overlapper," can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera's Drosophila reads. When we replaced Celera's overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.

Original languageEnglish (US)
Pages (from-to)734-752
Number of pages19
JournalJournal of Computational Biology
Volume11
Issue number4
DOIs
StatePublished - 2004
Externally publishedYes

Fingerprint

Firearms
Genome
Genes
Sequencing
Overlap
Fragment
Drosophilidae
Drosophila
Overlapping
Estimate

Keywords

  • Disk-based sorting
  • DNA fragment overlap determination
  • Sequencing error correction
  • Whole genome shotgun assembly

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics

Cite this

Roberts, M., Hunt, B. R., Yorke, J. A., Bolanos, R. A., & Delcher, A. L. (2004). A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4), 734-752. https://doi.org/10.1089/cmb.2004.11.734

A preprocessor for shotgun assembly of large genomes. / Roberts, Michael; Hunt, Brian R.; Yorke, James A.; Bolanos, Randall A.; Delcher, Arthur L.

In: Journal of Computational Biology, Vol. 11, No. 4, 2004, p. 734-752.

Research output: Contribution to journalArticle

Roberts, M, Hunt, BR, Yorke, JA, Bolanos, RA & Delcher, AL 2004, 'A preprocessor for shotgun assembly of large genomes', Journal of Computational Biology, vol. 11, no. 4, pp. 734-752. https://doi.org/10.1089/cmb.2004.11.734
Roberts, Michael ; Hunt, Brian R. ; Yorke, James A. ; Bolanos, Randall A. ; Delcher, Arthur L. / A preprocessor for shotgun assembly of large genomes. In: Journal of Computational Biology. 2004 ; Vol. 11, No. 4. pp. 734-752.
@article{8a1e0bdeb5944abc9de2f9a402c807ad,
title = "A preprocessor for shotgun assembly of large genomes",
abstract = "The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a {"}read.{"} Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of {"}overlaps,{"} i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the {"}UMD Overlapper,{"} can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera's Drosophila reads. When we replaced Celera's overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.",
keywords = "Disk-based sorting, DNA fragment overlap determination, Sequencing error correction, Whole genome shotgun assembly",
author = "Michael Roberts and Hunt, {Brian R.} and Yorke, {James A.} and Bolanos, {Randall A.} and Delcher, {Arthur L.}",
year = "2004",
doi = "10.1089/cmb.2004.11.734",
language = "English (US)",
volume = "11",
pages = "734--752",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "4",

}

TY - JOUR

T1 - A preprocessor for shotgun assembly of large genomes

AU - Roberts, Michael

AU - Hunt, Brian R.

AU - Yorke, James A.

AU - Bolanos, Randall A.

AU - Delcher, Arthur L.

PY - 2004

Y1 - 2004

N2 - The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a "read." Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of "overlaps," i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the "UMD Overlapper," can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera's Drosophila reads. When we replaced Celera's overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.

AB - The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a "read." Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of "overlaps," i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the "UMD Overlapper," can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera's Drosophila reads. When we replaced Celera's overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.

KW - Disk-based sorting

KW - DNA fragment overlap determination

KW - Sequencing error correction

KW - Whole genome shotgun assembly

UR - http://www.scopus.com/inward/record.url?scp=4544329299&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4544329299&partnerID=8YFLogxK

U2 - 10.1089/cmb.2004.11.734

DO - 10.1089/cmb.2004.11.734

M3 - Article

C2 - 15579242

AN - SCOPUS:4544329299

VL - 11

SP - 734

EP - 752

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 4

ER -