Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction

Hilary S. Parker; Jeffrey T. Leek; Alexander V. Favorov; Michael Considine; Xiaoxin Xia; Sameer Chavan; Christine H. Chung; Elana J. Fertig

doi:10.1093/bioinformatics/btu375

Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction

Hilary S. Parker, Jeffrey T. Leek, Alexander V. Favorov, Michael Considine, Xiaoxin Xia, Sameer Chavan, Christine H. Chung, Elana J. Fertig

Research output: Contribution to journal › Article › peer-review

32 Scopus citations

Abstract

Motivation: Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. Results: Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set.

Original language	English (US)
Pages (from-to)	2757-2763
Number of pages	7
Journal	Bioinformatics
Volume	30
Issue number	19
DOIs	https://doi.org/10.1093/bioinformatics/btu375
State	Published - Apr 2 2014

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/btu375

Cite this

@article{b4bb48240f294b62989e24be1b2a7a33,

title = "Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction",

abstract = "Motivation: Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. Results: Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set.",

author = "Parker, {Hilary S.} and Leek, {Jeffrey T.} and Favorov, {Alexander V.} and Michael Considine and Xiaoxin Xia and Sameer Chavan and Chung, {Christine H.} and Fertig, {Elana J.}",

note = "Publisher Copyright: {\textcopyright} 2014 The Author.",

year = "2014",

month = apr,

day = "2",

doi = "10.1093/bioinformatics/btu375",

language = "English (US)",

volume = "30",

pages = "2757--2763",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "19",

}

TY - JOUR

T1 - Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction

AU - Parker, Hilary S.

AU - Leek, Jeffrey T.

AU - Favorov, Alexander V.

AU - Considine, Michael

AU - Xia, Xiaoxin

AU - Chavan, Sameer

AU - Chung, Christine H.

AU - Fertig, Elana J.

PY - 2014/4/2

Y1 - 2014/4/2

N2 - Motivation: Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. Results: Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set.

AB - Motivation: Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. Results: Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set.

UR - http://www.scopus.com/inward/record.url?scp=84911406694&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84911406694&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btu375

DO - 10.1093/bioinformatics/btu375

M3 - Article

C2 - 24907368

AN - SCOPUS:84911406694

SN - 1367-4803

VL - 30

SP - 2757

EP - 2763

JO - Bioinformatics

JF - Bioinformatics

IS - 19

ER -

Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this