Test set bias affects reproducibility of gene signatures

Prasad Patil, Pierre Olivier Bachant-Winner, Benjamin Haibe-Kains, Jeffrey T. Leek

Research output: Contribution to journalArticlepeer-review


Motivation: Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized to ensure that the test set data are comparable to the data upon which the predictor was trained. The most effective normalization methods depend on data from multiple patients. From a biomedical perspective, this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction. Results: We demonstrate that results from existing gene signatures which rely on normalizing test data may be irreproducible when the patient population changes composition or size using a set of curated, publicly available breast cancer microarray experiments. As an alternative, we examine the use of gene signatures that rely on ranks from the data and show why signatures using rank-based features can avoid test set bias whilemaintaining highly accurate classification, even across platforms.

Original languageEnglish (US)
Pages (from-to)2318-2323
Number of pages6
Issue number14
StatePublished - Jul 15 2015

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics


Dive into the research topics of 'Test set bias affects reproducibility of gene signatures'. Together they form a unique fingerprint.

Cite this