Model-based quality assessment and base-calling for second-generation sequencing data

Héctor Corrada Bravo; Rafael A. Irizarry

doi:10.1111/j.1541-0420.2009.01353.x

Model-based quality assessment and base-calling for second-generation sequencing data

Héctor Corrada Bravo, Rafael A. Irizarry

Research output: Contribution to journal › Article › peer-review

41 Scopus citations

Abstract

Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads - strings of A,C,G, or T's, between 30 and 100 characters long - which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance.

Original language	English (US)
Pages (from-to)	665-674
Number of pages	10
Journal	Biometrics
Volume	66
Issue number	3
DOIs	https://doi.org/10.1111/j.1541-0420.2009.01353.x
State	Published - Sep 2010
Externally published	Yes

Keywords

Base-calling
Large-scale data analysis
Linear models
Quality assessment
Second-generation DNA sequencing

ASJC Scopus subject areas

Statistics and Probability
General Biochemistry, Genetics and Molecular Biology
General Immunology and Microbiology
General Agricultural and Biological Sciences
Applied Mathematics

Access to Document

10.1111/j.1541-0420.2009.01353.x

Cite this

@article{4d81393a112f4aa690f81eb47e98b920,

title = "Model-based quality assessment and base-calling for second-generation sequencing data",

abstract = "Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads - strings of A,C,G, or T's, between 30 and 100 characters long - which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance.",

keywords = "Base-calling, Large-scale data analysis, Linear models, Quality assessment, Second-generation DNA sequencing",

author = "Bravo, {H{\'e}ctor Corrada} and Irizarry, {Rafael A.}",

year = "2010",

month = sep,

doi = "10.1111/j.1541-0420.2009.01353.x",

language = "English (US)",

volume = "66",

pages = "665--674",

journal = "Biometrics",

issn = "0006-341X",

publisher = "Wiley-Blackwell",

number = "3",

}

TY - JOUR

T1 - Model-based quality assessment and base-calling for second-generation sequencing data

AU - Bravo, Héctor Corrada

AU - Irizarry, Rafael A.

PY - 2010/9

Y1 - 2010/9

N2 - Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads - strings of A,C,G, or T's, between 30 and 100 characters long - which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance.

AB - Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads - strings of A,C,G, or T's, between 30 and 100 characters long - which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance.

KW - Base-calling

KW - Large-scale data analysis

KW - Linear models

KW - Quality assessment

KW - Second-generation DNA sequencing

UR - http://www.scopus.com/inward/record.url?scp=77956837921&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77956837921&partnerID=8YFLogxK

U2 - 10.1111/j.1541-0420.2009.01353.x

DO - 10.1111/j.1541-0420.2009.01353.x

M3 - Article

C2 - 19912177

AN - SCOPUS:77956837921

SN - 0006-341X

VL - 66

SP - 665

EP - 674

JO - Biometrics

JF - Biometrics

IS - 3

ER -

Model-based quality assessment and base-calling for second-generation sequencing data

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this