The Terabase Search Engine: A large-scale relational database of short-read sequences

Richard Wilton; Sarah J. Wheelan; Alexander S. Szalay; Steven L. Salzberg

doi:10.1093/bioinformatics/bty657

The Terabase Search Engine: A large-scale relational database of short-read sequences

Richard Wilton, Sarah J. Wheelan, Alexander S. Szalay, Steven L. Salzberg

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples. Results: We have built a compact, efficiently-indexed database that contains the raw read data for over 250 human genomes, encompassing trillions of bases of DNA, and that allows users to search these data in real-time. The Terabase Search Engine enables retrieval from this database of all the reads for any genomic location in a matter of seconds. Users can search using a range of positions or a specific sequence that is aligned to the genome on the fly.

Original language	English (US)
Pages (from-to)	665-670
Number of pages	6
Journal	Bioinformatics
Volume	35
Issue number	4
DOIs	https://doi.org/10.1093/bioinformatics/bty657
State	Published - Feb 15 2019

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/bty657

Cite this

@article{7770a3cdfe37487d9195db31cbdeb161,

title = "The Terabase Search Engine: A large-scale relational database of short-read sequences",

abstract = "DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples. Results: We have built a compact, efficiently-indexed database that contains the raw read data for over 250 human genomes, encompassing trillions of bases of DNA, and that allows users to search these data in real-time. The Terabase Search Engine enables retrieval from this database of all the reads for any genomic location in a matter of seconds. Users can search using a range of positions or a specific sequence that is aligned to the genome on the fly.",

author = "Richard Wilton and Wheelan, {Sarah J.} and Szalay, {Alexander S.} and Salzberg, {Steven L.}",

year = "2019",

month = feb,

day = "15",

doi = "10.1093/bioinformatics/bty657",

language = "English (US)",

volume = "35",

pages = "665--670",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "4",

}

TY - JOUR

T1 - The Terabase Search Engine

T2 - A large-scale relational database of short-read sequences

AU - Wilton, Richard

AU - Wheelan, Sarah J.

AU - Szalay, Alexander S.

AU - Salzberg, Steven L.

PY - 2019/2/15

Y1 - 2019/2/15

N2 - DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples. Results: We have built a compact, efficiently-indexed database that contains the raw read data for over 250 human genomes, encompassing trillions of bases of DNA, and that allows users to search these data in real-time. The Terabase Search Engine enables retrieval from this database of all the reads for any genomic location in a matter of seconds. Users can search using a range of positions or a specific sequence that is aligned to the genome on the fly.

AB - DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples. Results: We have built a compact, efficiently-indexed database that contains the raw read data for over 250 human genomes, encompassing trillions of bases of DNA, and that allows users to search these data in real-time. The Terabase Search Engine enables retrieval from this database of all the reads for any genomic location in a matter of seconds. Users can search using a range of positions or a specific sequence that is aligned to the genome on the fly.

UR - http://www.scopus.com/inward/record.url?scp=85062056868&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062056868&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bty657

DO - 10.1093/bioinformatics/bty657

M3 - Article

C2 - 30052772

AN - SCOPUS:85062056868

SN - 1367-4803

VL - 35

SP - 665

EP - 670

JO - Bioinformatics

JF - Bioinformatics

IS - 4

ER -

The Terabase Search Engine: A large-scale relational database of short-read sequences

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this