Analysis and correction of compositional bias in sparse sequencing count data

M. Senthil Kumar, Eric V. Slud, Kwame Okrah, Stephanie Hicks, Sridhar Hannenhalli, Héctor Corrada Bravo

Research output: Contribution to journalArticle

Abstract

BACKGROUND: Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. Commonly used count data normalization approaches like library size scaling/rarefaction/subsampling cannot correct for compositional or any other relevant technical bias that is uncorrelated with library size. RESULTS: We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it. CONCLUSIONS: Compositional bias, induced by the sequencing machine, confounds inferences of absolute abundances. We present a normalization technique for compositional bias correction in sparse sequencing count data, and demonstrate its improved performance in metagenomic 16s survey data. Based on the distribution of technical bias estimates arising from several publicly available large scale 16s count datasets, we argue that detailed experiments specifically addressing the influence of compositional bias in metagenomics are needed.

Original languageEnglish (US)
Number of pages1
JournalBMC Genomics
Volume19
Issue number1
DOIs
StatePublished - Nov 6 2018
Externally publishedYes

Fingerprint

Metagenomics
Libraries
Acids

Keywords

  • Absolute abundance
  • Compositional bias
  • Count data
  • Data integration
  • Empirical Bayes
  • Metagenomics
  • Normalization
  • scRNAseq
  • Spike-in

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

Kumar, M. S., Slud, E. V., Okrah, K., Hicks, S., Hannenhalli, S., & Corrada Bravo, H. (2018). Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics, 19(1). https://doi.org/10.1186/s12864-018-5160-5

Analysis and correction of compositional bias in sparse sequencing count data. / Kumar, M. Senthil; Slud, Eric V.; Okrah, Kwame; Hicks, Stephanie; Hannenhalli, Sridhar; Corrada Bravo, Héctor.

In: BMC Genomics, Vol. 19, No. 1, 06.11.2018.

Research output: Contribution to journalArticle

Kumar, M. Senthil ; Slud, Eric V. ; Okrah, Kwame ; Hicks, Stephanie ; Hannenhalli, Sridhar ; Corrada Bravo, Héctor. / Analysis and correction of compositional bias in sparse sequencing count data. In: BMC Genomics. 2018 ; Vol. 19, No. 1.
@article{8198703b3fb74a23b9829854dd84f030,
title = "Analysis and correction of compositional bias in sparse sequencing count data",
abstract = "BACKGROUND: Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. Commonly used count data normalization approaches like library size scaling/rarefaction/subsampling cannot correct for compositional or any other relevant technical bias that is uncorrelated with library size. RESULTS: We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it. CONCLUSIONS: Compositional bias, induced by the sequencing machine, confounds inferences of absolute abundances. We present a normalization technique for compositional bias correction in sparse sequencing count data, and demonstrate its improved performance in metagenomic 16s survey data. Based on the distribution of technical bias estimates arising from several publicly available large scale 16s count datasets, we argue that detailed experiments specifically addressing the influence of compositional bias in metagenomics are needed.",
keywords = "Absolute abundance, Compositional bias, Count data, Data integration, Empirical Bayes, Metagenomics, Normalization, scRNAseq, Spike-in",
author = "Kumar, {M. Senthil} and Slud, {Eric V.} and Kwame Okrah and Stephanie Hicks and Sridhar Hannenhalli and {Corrada Bravo}, H{\'e}ctor",
year = "2018",
month = "11",
day = "6",
doi = "10.1186/s12864-018-5160-5",
language = "English (US)",
volume = "19",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Analysis and correction of compositional bias in sparse sequencing count data

AU - Kumar, M. Senthil

AU - Slud, Eric V.

AU - Okrah, Kwame

AU - Hicks, Stephanie

AU - Hannenhalli, Sridhar

AU - Corrada Bravo, Héctor

PY - 2018/11/6

Y1 - 2018/11/6

N2 - BACKGROUND: Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. Commonly used count data normalization approaches like library size scaling/rarefaction/subsampling cannot correct for compositional or any other relevant technical bias that is uncorrelated with library size. RESULTS: We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it. CONCLUSIONS: Compositional bias, induced by the sequencing machine, confounds inferences of absolute abundances. We present a normalization technique for compositional bias correction in sparse sequencing count data, and demonstrate its improved performance in metagenomic 16s survey data. Based on the distribution of technical bias estimates arising from several publicly available large scale 16s count datasets, we argue that detailed experiments specifically addressing the influence of compositional bias in metagenomics are needed.

AB - BACKGROUND: Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. Commonly used count data normalization approaches like library size scaling/rarefaction/subsampling cannot correct for compositional or any other relevant technical bias that is uncorrelated with library size. RESULTS: We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it. CONCLUSIONS: Compositional bias, induced by the sequencing machine, confounds inferences of absolute abundances. We present a normalization technique for compositional bias correction in sparse sequencing count data, and demonstrate its improved performance in metagenomic 16s survey data. Based on the distribution of technical bias estimates arising from several publicly available large scale 16s count datasets, we argue that detailed experiments specifically addressing the influence of compositional bias in metagenomics are needed.

KW - Absolute abundance

KW - Compositional bias

KW - Count data

KW - Data integration

KW - Empirical Bayes

KW - Metagenomics

KW - Normalization

KW - scRNAseq

KW - Spike-in

UR - http://www.scopus.com/inward/record.url?scp=85056286133&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85056286133&partnerID=8YFLogxK

U2 - 10.1186/s12864-018-5160-5

DO - 10.1186/s12864-018-5160-5

M3 - Article

C2 - 30400812

AN - SCOPUS:85056286133

VL - 19

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

ER -