TopHat: Discovering splice junctions with RNA-Seq

Cole Trapnell; Lior Pachter; Steven L. Salzberg

doi:10.1093/bioinformatics/btp120

TopHat: Discovering splice junctions with RNA-Seq

Cole Trapnell, Lior Pachter, Steven L. Salzberg

Research output: Contribution to journal › Article › peer-review

8091 Scopus citations

Abstract

Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.

Original language	English (US)
Pages (from-to)	1105-1111
Number of pages	7
Journal	Bioinformatics
Volume	25
Issue number	9
DOIs	https://doi.org/10.1093/bioinformatics/btp120
State	Published - 2009
Externally published	Yes

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btp120

Cite this

@article{f00a3c6a1d014458be2dfcf35c998969,

title = "TopHat: Discovering splice junctions with RNA-Seq",

abstract = "Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.",

author = "Cole Trapnell and Lior Pachter and Salzberg, {Steven L.}",

note = "Funding Information: Funding: National Institues of Health (R01-LM06845, R01-GM083873 to S.L.S.); National Science Foundation (CCF 0347992 to L.P.).",

year = "2009",

doi = "10.1093/bioinformatics/btp120",

language = "English (US)",

volume = "25",

pages = "1105--1111",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "9",

}

TY - JOUR

T1 - TopHat

T2 - Discovering splice junctions with RNA-Seq

AU - Trapnell, Cole

AU - Pachter, Lior

AU - Salzberg, Steven L.

N1 - Funding Information: Funding: National Institues of Health (R01-LM06845, R01-GM083873 to S.L.S.); National Science Foundation (CCF 0347992 to L.P.).

PY - 2009

Y1 - 2009

N2 - Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.

AB - Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.

UR - http://www.scopus.com/inward/record.url?scp=65449136284&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=65449136284&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btp120

DO - 10.1093/bioinformatics/btp120

M3 - Article

C2 - 19289445

AN - SCOPUS:65449136284

SN - 1367-4803

VL - 25

SP - 1105

EP - 1111

JO - Bioinformatics

JF - Bioinformatics

IS - 9

ER -

TopHat: Discovering splice junctions with RNA-Seq

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this