The Terabase Search Engine: A large-scale relational database of short-read sequences

Richard Wilton, Sarah J. Wheelan, Alexander S. Szalay, Steven L. Salzberg

Research output: Contribution to journalArticlepeer-review

Abstract

DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples. Results: We have built a compact, efficiently-indexed database that contains the raw read data for over 250 human genomes, encompassing trillions of bases of DNA, and that allows users to search these data in real-time. The Terabase Search Engine enables retrieval from this database of all the reads for any genomic location in a matter of seconds. Users can search using a range of positions or a specific sequence that is aligned to the genome on the fly.

Original languageEnglish (US)
Pages (from-to)665-670
Number of pages6
JournalBioinformatics
Volume35
Issue number4
DOIs
StatePublished - Feb 15 2019

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Fingerprint Dive into the research topics of 'The Terabase Search Engine: A large-scale relational database of short-read sequences'. Together they form a unique fingerprint.

Cite this