Estimation of sequencing error rates in short reads

Xin Victoria Wang, Natalie Blades, Jie Ding, Razvan Sultana, Giovanni Parmigiani

Research output: Contribution to journalArticle

Abstract

Background: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments.Results: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html.Conclusions: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.

Original languageEnglish (US)
Article number185
JournalBMC Bioinformatics
Volume13
Issue number1
DOIs
StatePublished - Jul 30 2012
Externally publishedYes

Fingerprint

Sequencing
Error Rate
Genes
Genome
Calibration
Pipelines
Experiments
Estimate
Fidelity
Experiment
Technology
Slope
Count
Monitor
Mutation
Simulation Study
Monitoring
Research
Methodology
Alternatives

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics
  • Structural Biology

Cite this

Victoria Wang, X., Blades, N., Ding, J., Sultana, R., & Parmigiani, G. (2012). Estimation of sequencing error rates in short reads. BMC Bioinformatics, 13(1), [185]. https://doi.org/10.1186/1471-2105-13-185

Estimation of sequencing error rates in short reads. / Victoria Wang, Xin; Blades, Natalie; Ding, Jie; Sultana, Razvan; Parmigiani, Giovanni.

In: BMC Bioinformatics, Vol. 13, No. 1, 185, 30.07.2012.

Research output: Contribution to journalArticle

Victoria Wang, X, Blades, N, Ding, J, Sultana, R & Parmigiani, G 2012, 'Estimation of sequencing error rates in short reads', BMC Bioinformatics, vol. 13, no. 1, 185. https://doi.org/10.1186/1471-2105-13-185
Victoria Wang X, Blades N, Ding J, Sultana R, Parmigiani G. Estimation of sequencing error rates in short reads. BMC Bioinformatics. 2012 Jul 30;13(1). 185. https://doi.org/10.1186/1471-2105-13-185
Victoria Wang, Xin ; Blades, Natalie ; Ding, Jie ; Sultana, Razvan ; Parmigiani, Giovanni. / Estimation of sequencing error rates in short reads. In: BMC Bioinformatics. 2012 ; Vol. 13, No. 1.
@article{dab898999d03412b933202b9b04ea34d,
title = "Estimation of sequencing error rates in short reads",
abstract = "Background: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments.Results: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html.Conclusions: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.",
author = "{Victoria Wang}, Xin and Natalie Blades and Jie Ding and Razvan Sultana and Giovanni Parmigiani",
year = "2012",
month = "7",
day = "30",
doi = "10.1186/1471-2105-13-185",
language = "English (US)",
volume = "13",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Estimation of sequencing error rates in short reads

AU - Victoria Wang, Xin

AU - Blades, Natalie

AU - Ding, Jie

AU - Sultana, Razvan

AU - Parmigiani, Giovanni

PY - 2012/7/30

Y1 - 2012/7/30

N2 - Background: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments.Results: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html.Conclusions: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.

AB - Background: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments.Results: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html.Conclusions: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.

UR - http://www.scopus.com/inward/record.url?scp=84869089137&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84869089137&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-13-185

DO - 10.1186/1471-2105-13-185

M3 - Article

C2 - 22846331

AN - SCOPUS:84879446930

VL - 13

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 185

ER -