Missing data and technical variability in single-cell RNA-sequencing experiments

Stephanie Hicks, F. William Townes, Mingxiang Teng, Rafael A. Irizarry

Research output: Contribution to journalArticle

Abstract

Until recently, high-throughput gene expression technology, such asRNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RN A-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority ofreported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.

Original languageEnglish (US)
Pages (from-to)562-578
Number of pages17
JournalBiostatistics
Volume19
Issue number4
DOIs
StatePublished - Oct 1 2018
Externally publishedYes

Fingerprint

Missing Data
Sequencing
Cell
Experiment
Gene
Missing data
Systematic Error
Zero
High Throughput
Batch
Gene Expression
Vary
Gene expression
Throughput
Demonstrate
Percentage
Genome
Proportion

Keywords

  • Censoring
  • Confounding.
  • Genomics
  • Missing not at random (MNAR)
  • Single-cell RNA-Sequencing

ASJC Scopus subject areas

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Cite this

Missing data and technical variability in single-cell RNA-sequencing experiments. / Hicks, Stephanie; Townes, F. William; Teng, Mingxiang; Irizarry, Rafael A.

In: Biostatistics, Vol. 19, No. 4, 01.10.2018, p. 562-578.

Research output: Contribution to journalArticle

Hicks, Stephanie ; Townes, F. William ; Teng, Mingxiang ; Irizarry, Rafael A. / Missing data and technical variability in single-cell RNA-sequencing experiments. In: Biostatistics. 2018 ; Vol. 19, No. 4. pp. 562-578.
@article{963f20f39f33422f834e0ccddafc20d1,
title = "Missing data and technical variability in single-cell RNA-sequencing experiments",
abstract = "Until recently, high-throughput gene expression technology, such asRNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RN A-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority ofreported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.",
keywords = "Censoring, Confounding., Genomics, Missing not at random (MNAR), Single-cell RNA-Sequencing",
author = "Stephanie Hicks and Townes, {F. William} and Mingxiang Teng and Irizarry, {Rafael A.}",
year = "2018",
month = "10",
day = "1",
doi = "10.1093/biostatistics/kxx053",
language = "English (US)",
volume = "19",
pages = "562--578",
journal = "Biostatistics",
issn = "1465-4644",
publisher = "Oxford University Press",
number = "4",

}

TY - JOUR

T1 - Missing data and technical variability in single-cell RNA-sequencing experiments

AU - Hicks, Stephanie

AU - Townes, F. William

AU - Teng, Mingxiang

AU - Irizarry, Rafael A.

PY - 2018/10/1

Y1 - 2018/10/1

N2 - Until recently, high-throughput gene expression technology, such asRNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RN A-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority ofreported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.

AB - Until recently, high-throughput gene expression technology, such asRNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RN A-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority ofreported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.

KW - Censoring

KW - Confounding.

KW - Genomics

KW - Missing not at random (MNAR)

KW - Single-cell RNA-Sequencing

UR - http://www.scopus.com/inward/record.url?scp=85054726691&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054726691&partnerID=8YFLogxK

U2 - 10.1093/biostatistics/kxx053

DO - 10.1093/biostatistics/kxx053

M3 - Article

C2 - 29121214

AN - SCOPUS:85054726691

VL - 19

SP - 562

EP - 578

JO - Biostatistics

JF - Biostatistics

SN - 1465-4644

IS - 4

ER -