Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction

Hilary S. Parker, Jeffrey T. Leek, Alexander V. Favorov, Michael Considine, Xiaoxin Xia, Sameer Chavan, Christine H. Chung, Elana J. Fertig

Research output: Contribution to journalArticlepeer-review


Motivation: Sample source, procurement process and other technical variations introduce batch effects into genomics data. Algorithms to remove these artifacts enhance differences between known biological covariates, but also carry potential concern of removing intragroup biological heterogeneity and thus any personalized genomic signatures. As a result, accurate identification of novel subtypes from batch-corrected genomics data is challenging using standard algorithms designed to remove batch effects for class comparison analyses. Nor can batch effects be corrected reliably in future applications of genomics-based clinical tests, in which the biological groups are by definition unknown a priori. Results: Therefore, we assess the extent to which various batch correction algorithms remove true biological heterogeneity. We also introduce an algorithm, permuted-SVA (pSVA), using a new statistical model that is blind to biological covariates to correct for technical artifacts while retaining biological heterogeneity in genomic data. This algorithm facilitated accurate subtype identification in head and neck cancer from gene expression data in both formalin-fixed and frozen samples. When applied to predict Human Papillomavirus (HPV) status, pSVA improved cross-study validation even if the sample batches were highly confounded with HPV status in the training set.

Original languageEnglish (US)
Pages (from-to)2757-2763
Number of pages7
Issue number19
StatePublished - Apr 2 2014

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Fingerprint Dive into the research topics of 'Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction'. Together they form a unique fingerprint.

Cite this