Smooth quantile normalization

Stephanie Hicks, Kwame Okrah, Joseph N. Paulson, John Quackenbush, Rafael A. Irizarry, Héctor Corrada Bravo

Research output: Contribution to journalArticle

Abstract

Between-sample normalization is a critical step in genomic data analysis to remove systematic bias and unwanted technical variation in high-throughput data. Global normalization methods are based on the assumption that observed variability in global properties is due to technical reasons and are unrelated to the biology of interest. For example, some methods correct for differences in sequencing read counts by scaling features to have similar median values across samples, but these fail to reduce other forms of unwanted technical variation. Methods such as quantile normalization transform the statistical distributions across samples to be the same and assume global differences in the distribution are induced by only technical variation. However, it remains unclear how to proceed with normalization if these assumptions are violated, for example, if there are global differences in the statistical distributions between biological conditions or groups, and external information, such as negative or control features, is not available. Here, we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is based on the assumption that the statistical distribution of each sample should be the same (or have the same distributional shape) within biological groups or conditions, but allowing that they may differ between groups. We illustrate the advantages of our method on several high-throughput datasets with global differences in distributions corresponding to different biological conditions.We also perform a Monte Carlo simulation study to illustrate the bias-variance tradeoff and root mean squared error of qsmooth compared to other global normalization methods.A software implementation is available from https://github.com/stephaniehicks/qsmooth.

Original languageEnglish (US)
Pages (from-to)185-198
Number of pages14
JournalBiostatistics
Volume19
Issue number2
DOIs
StatePublished - Apr 1 2018
Externally publishedYes

Fingerprint

Quantile
Normalization
Statistical Distribution
High Throughput
Mean Squared Error
Sequencing
Biology
Genomics
Data analysis
Count
Monte Carlo Simulation
Trade-offs
Roots
Simulation Study
Scaling
Transform
Software

Keywords

  • Global normalization methods
  • Quantile normalization

ASJC Scopus subject areas

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Cite this

Hicks, S., Okrah, K., Paulson, J. N., Quackenbush, J., Irizarry, R. A., & Bravo, H. C. (2018). Smooth quantile normalization. Biostatistics, 19(2), 185-198. https://doi.org/10.1093/biostatistics/kxx028

Smooth quantile normalization. / Hicks, Stephanie; Okrah, Kwame; Paulson, Joseph N.; Quackenbush, John; Irizarry, Rafael A.; Bravo, Héctor Corrada.

In: Biostatistics, Vol. 19, No. 2, 01.04.2018, p. 185-198.

Research output: Contribution to journalArticle

Hicks, S, Okrah, K, Paulson, JN, Quackenbush, J, Irizarry, RA & Bravo, HC 2018, 'Smooth quantile normalization', Biostatistics, vol. 19, no. 2, pp. 185-198. https://doi.org/10.1093/biostatistics/kxx028
Hicks S, Okrah K, Paulson JN, Quackenbush J, Irizarry RA, Bravo HC. Smooth quantile normalization. Biostatistics. 2018 Apr 1;19(2):185-198. https://doi.org/10.1093/biostatistics/kxx028
Hicks, Stephanie ; Okrah, Kwame ; Paulson, Joseph N. ; Quackenbush, John ; Irizarry, Rafael A. ; Bravo, Héctor Corrada. / Smooth quantile normalization. In: Biostatistics. 2018 ; Vol. 19, No. 2. pp. 185-198.
@article{895a4b0561804a29b8d40616ed8678ad,
title = "Smooth quantile normalization",
abstract = "Between-sample normalization is a critical step in genomic data analysis to remove systematic bias and unwanted technical variation in high-throughput data. Global normalization methods are based on the assumption that observed variability in global properties is due to technical reasons and are unrelated to the biology of interest. For example, some methods correct for differences in sequencing read counts by scaling features to have similar median values across samples, but these fail to reduce other forms of unwanted technical variation. Methods such as quantile normalization transform the statistical distributions across samples to be the same and assume global differences in the distribution are induced by only technical variation. However, it remains unclear how to proceed with normalization if these assumptions are violated, for example, if there are global differences in the statistical distributions between biological conditions or groups, and external information, such as negative or control features, is not available. Here, we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is based on the assumption that the statistical distribution of each sample should be the same (or have the same distributional shape) within biological groups or conditions, but allowing that they may differ between groups. We illustrate the advantages of our method on several high-throughput datasets with global differences in distributions corresponding to different biological conditions.We also perform a Monte Carlo simulation study to illustrate the bias-variance tradeoff and root mean squared error of qsmooth compared to other global normalization methods.A software implementation is available from https://github.com/stephaniehicks/qsmooth.",
keywords = "Global normalization methods, Quantile normalization",
author = "Stephanie Hicks and Kwame Okrah and Paulson, {Joseph N.} and John Quackenbush and Irizarry, {Rafael A.} and Bravo, {H{\'e}ctor Corrada}",
year = "2018",
month = "4",
day = "1",
doi = "10.1093/biostatistics/kxx028",
language = "English (US)",
volume = "19",
pages = "185--198",
journal = "Biostatistics",
issn = "1465-4644",
publisher = "Oxford University Press",
number = "2",

}

TY - JOUR

T1 - Smooth quantile normalization

AU - Hicks, Stephanie

AU - Okrah, Kwame

AU - Paulson, Joseph N.

AU - Quackenbush, John

AU - Irizarry, Rafael A.

AU - Bravo, Héctor Corrada

PY - 2018/4/1

Y1 - 2018/4/1

N2 - Between-sample normalization is a critical step in genomic data analysis to remove systematic bias and unwanted technical variation in high-throughput data. Global normalization methods are based on the assumption that observed variability in global properties is due to technical reasons and are unrelated to the biology of interest. For example, some methods correct for differences in sequencing read counts by scaling features to have similar median values across samples, but these fail to reduce other forms of unwanted technical variation. Methods such as quantile normalization transform the statistical distributions across samples to be the same and assume global differences in the distribution are induced by only technical variation. However, it remains unclear how to proceed with normalization if these assumptions are violated, for example, if there are global differences in the statistical distributions between biological conditions or groups, and external information, such as negative or control features, is not available. Here, we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is based on the assumption that the statistical distribution of each sample should be the same (or have the same distributional shape) within biological groups or conditions, but allowing that they may differ between groups. We illustrate the advantages of our method on several high-throughput datasets with global differences in distributions corresponding to different biological conditions.We also perform a Monte Carlo simulation study to illustrate the bias-variance tradeoff and root mean squared error of qsmooth compared to other global normalization methods.A software implementation is available from https://github.com/stephaniehicks/qsmooth.

AB - Between-sample normalization is a critical step in genomic data analysis to remove systematic bias and unwanted technical variation in high-throughput data. Global normalization methods are based on the assumption that observed variability in global properties is due to technical reasons and are unrelated to the biology of interest. For example, some methods correct for differences in sequencing read counts by scaling features to have similar median values across samples, but these fail to reduce other forms of unwanted technical variation. Methods such as quantile normalization transform the statistical distributions across samples to be the same and assume global differences in the distribution are induced by only technical variation. However, it remains unclear how to proceed with normalization if these assumptions are violated, for example, if there are global differences in the statistical distributions between biological conditions or groups, and external information, such as negative or control features, is not available. Here, we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is based on the assumption that the statistical distribution of each sample should be the same (or have the same distributional shape) within biological groups or conditions, but allowing that they may differ between groups. We illustrate the advantages of our method on several high-throughput datasets with global differences in distributions corresponding to different biological conditions.We also perform a Monte Carlo simulation study to illustrate the bias-variance tradeoff and root mean squared error of qsmooth compared to other global normalization methods.A software implementation is available from https://github.com/stephaniehicks/qsmooth.

KW - Global normalization methods

KW - Quantile normalization

UR - http://www.scopus.com/inward/record.url?scp=85029389470&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029389470&partnerID=8YFLogxK

U2 - 10.1093/biostatistics/kxx028

DO - 10.1093/biostatistics/kxx028

M3 - Article

C2 - 29036413

AN - SCOPUS:85029389470

VL - 19

SP - 185

EP - 198

JO - Biostatistics

JF - Biostatistics

SN - 1465-4644

IS - 2

ER -