GAGE: A critical evaluation of genome assemblies and assembly algorithms

Steven L Salzberg, Adam M. Phillippy, Aleksey Zimin, Daniela Puiu, Tanja Magoc, Sergey Koren, Todd J. Treangen, Michael C. Schatz, Arthur L. Delcher, Michael Roberts, Guillaume Marcxais, Mihai Pop, James A. Yorke

Research output: Contribution to journalArticle

Abstract

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three over-arching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

Original languageEnglish (US)
Pages (from-to)557-567
Number of pages11
JournalGenome Research
Volume22
Issue number3
DOIs
StatePublished - Mar 2012

Fingerprint

Genome
Technology
Mammals
Costs and Cost Analysis

ASJC Scopus subject areas

  • Genetics
  • Genetics(clinical)

Cite this

Salzberg, S. L., Phillippy, A. M., Zimin, A., Puiu, D., Magoc, T., Koren, S., ... Yorke, J. A. (2012). GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research, 22(3), 557-567. https://doi.org/10.1101/gr.131383.111

GAGE : A critical evaluation of genome assemblies and assembly algorithms. / Salzberg, Steven L; Phillippy, Adam M.; Zimin, Aleksey; Puiu, Daniela; Magoc, Tanja; Koren, Sergey; Treangen, Todd J.; Schatz, Michael C.; Delcher, Arthur L.; Roberts, Michael; Marcxais, Guillaume; Pop, Mihai; Yorke, James A.

In: Genome Research, Vol. 22, No. 3, 03.2012, p. 557-567.

Research output: Contribution to journalArticle

Salzberg, SL, Phillippy, AM, Zimin, A, Puiu, D, Magoc, T, Koren, S, Treangen, TJ, Schatz, MC, Delcher, AL, Roberts, M, Marcxais, G, Pop, M & Yorke, JA 2012, 'GAGE: A critical evaluation of genome assemblies and assembly algorithms', Genome Research, vol. 22, no. 3, pp. 557-567. https://doi.org/10.1101/gr.131383.111
Salzberg, Steven L ; Phillippy, Adam M. ; Zimin, Aleksey ; Puiu, Daniela ; Magoc, Tanja ; Koren, Sergey ; Treangen, Todd J. ; Schatz, Michael C. ; Delcher, Arthur L. ; Roberts, Michael ; Marcxais, Guillaume ; Pop, Mihai ; Yorke, James A. / GAGE : A critical evaluation of genome assemblies and assembly algorithms. In: Genome Research. 2012 ; Vol. 22, No. 3. pp. 557-567.
@article{be54d9bafe174a889a45817815d688fa,
title = "GAGE: A critical evaluation of genome assemblies and assembly algorithms",
abstract = "New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three over-arching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.",
author = "Salzberg, {Steven L} and Phillippy, {Adam M.} and Aleksey Zimin and Daniela Puiu and Tanja Magoc and Sergey Koren and Treangen, {Todd J.} and Schatz, {Michael C.} and Delcher, {Arthur L.} and Michael Roberts and Guillaume Marcxais and Mihai Pop and Yorke, {James A.}",
year = "2012",
month = "3",
doi = "10.1101/gr.131383.111",
language = "English (US)",
volume = "22",
pages = "557--567",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "3",

}

TY - JOUR

T1 - GAGE

T2 - A critical evaluation of genome assemblies and assembly algorithms

AU - Salzberg, Steven L

AU - Phillippy, Adam M.

AU - Zimin, Aleksey

AU - Puiu, Daniela

AU - Magoc, Tanja

AU - Koren, Sergey

AU - Treangen, Todd J.

AU - Schatz, Michael C.

AU - Delcher, Arthur L.

AU - Roberts, Michael

AU - Marcxais, Guillaume

AU - Pop, Mihai

AU - Yorke, James A.

PY - 2012/3

Y1 - 2012/3

N2 - New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three over-arching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

AB - New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three over-arching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

UR - http://www.scopus.com/inward/record.url?scp=84857893016&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84857893016&partnerID=8YFLogxK

U2 - 10.1101/gr.131383.111

DO - 10.1101/gr.131383.111

M3 - Article

C2 - 22147368

AN - SCOPUS:84857893016

VL - 22

SP - 557

EP - 567

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 3

ER -