Short Read Mapping: An Algorithmic Tour

Stefan Canzar, Steven L Salzberg

Research output: Contribution to journalArticle

Abstract

Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein-DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or 'reads,' that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurements, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin. Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3-billion-base-pair-long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.

Original languageEnglish (US)
Article number7244195
Pages (from-to)436-458
Number of pages23
JournalProceedings of the IEEE
Volume105
Issue number3
DOIs
StatePublished - Mar 1 2017

Fingerprint

Genes
DNA
DNA sequences
Nucleotides
Gene expression
Throughput
Proteins
Molecules
Costs

Keywords

  • Burrows-Wheeler transform
  • DNA sequencing
  • sequence alignment
  • string matching
  • suffix trees

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Cite this

Short Read Mapping : An Algorithmic Tour. / Canzar, Stefan; Salzberg, Steven L.

In: Proceedings of the IEEE, Vol. 105, No. 3, 7244195, 01.03.2017, p. 436-458.

Research output: Contribution to journalArticle

Canzar, Stefan ; Salzberg, Steven L. / Short Read Mapping : An Algorithmic Tour. In: Proceedings of the IEEE. 2017 ; Vol. 105, No. 3. pp. 436-458.
@article{3af29538c4c446ac9d2914c1a69dcc32,
title = "Short Read Mapping: An Algorithmic Tour",
abstract = "Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein-DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or 'reads,' that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurements, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin. Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3-billion-base-pair-long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.",
keywords = "Burrows-Wheeler transform, DNA sequencing, sequence alignment, string matching, suffix trees",
author = "Stefan Canzar and Salzberg, {Steven L}",
year = "2017",
month = "3",
day = "1",
doi = "10.1109/JPROC.2015.2455551",
language = "English (US)",
volume = "105",
pages = "436--458",
journal = "Proceedings of the IEEE",
issn = "0018-9219",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "3",

}

TY - JOUR

T1 - Short Read Mapping

T2 - An Algorithmic Tour

AU - Canzar, Stefan

AU - Salzberg, Steven L

PY - 2017/3/1

Y1 - 2017/3/1

N2 - Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein-DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or 'reads,' that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurements, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin. Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3-billion-base-pair-long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.

AB - Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein-DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or 'reads,' that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurements, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin. Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3-billion-base-pair-long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.

KW - Burrows-Wheeler transform

KW - DNA sequencing

KW - sequence alignment

KW - string matching

KW - suffix trees

UR - http://www.scopus.com/inward/record.url?scp=84941243193&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84941243193&partnerID=8YFLogxK

U2 - 10.1109/JPROC.2015.2455551

DO - 10.1109/JPROC.2015.2455551

M3 - Article

C2 - 28502990

AN - SCOPUS:84941243193

VL - 105

SP - 436

EP - 458

JO - Proceedings of the IEEE

JF - Proceedings of the IEEE

SN - 0018-9219

IS - 3

M1 - 7244195

ER -