Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

Michael I. Love, John B. Hogenesch, Rafael A. Irizarry

Research output: Contribution to journalArticle

Abstract

We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.

Original languageEnglish (US)
Pages (from-to)1287-1291
Number of pages5
JournalNature Biotechnology
Volume34
Issue number12
DOIs
StatePublished - Dec 1 2016
Externally publishedYes

Fingerprint

Data visualization
Systematic errors
Computational methods
RNA
Protein Isoforms
Base Composition

ASJC Scopus subject areas

  • Biotechnology
  • Bioengineering
  • Applied Microbiology and Biotechnology
  • Biomedical Engineering
  • Molecular Medicine

Cite this

Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. / Love, Michael I.; Hogenesch, John B.; Irizarry, Rafael A.

In: Nature Biotechnology, Vol. 34, No. 12, 01.12.2016, p. 1287-1291.

Research output: Contribution to journalArticle

Love, Michael I. ; Hogenesch, John B. ; Irizarry, Rafael A. / Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. In: Nature Biotechnology. 2016 ; Vol. 34, No. 12. pp. 1287-1291.
@article{fe013d5edceb403f80bf36ff2a42912d,
title = "Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation",
abstract = "We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.",
author = "Love, {Michael I.} and Hogenesch, {John B.} and Irizarry, {Rafael A.}",
year = "2016",
month = "12",
day = "1",
doi = "10.1038/nbt.3682",
language = "English (US)",
volume = "34",
pages = "1287--1291",
journal = "Biotechnology",
issn = "1087-0156",
publisher = "Nature Publishing Group",
number = "12",

}

TY - JOUR

T1 - Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

AU - Love, Michael I.

AU - Hogenesch, John B.

AU - Irizarry, Rafael A.

PY - 2016/12/1

Y1 - 2016/12/1

N2 - We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.

AB - We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.

UR - http://www.scopus.com/inward/record.url?scp=85003441754&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85003441754&partnerID=8YFLogxK

U2 - 10.1038/nbt.3682

DO - 10.1038/nbt.3682

M3 - Article

C2 - 27669167

AN - SCOPUS:85003441754

VL - 34

SP - 1287

EP - 1291

JO - Biotechnology

JF - Biotechnology

SN - 1087-0156

IS - 12

ER -