The doppelganger effect: Hidden duplicates in databases of transcriptome profiles

Levi Waldron, Markus Riester, Marcel Ramos, Giovanni Parmigiani, Michael Birrer

Research output: Contribution to journalArticlepeer-review


Whole-genome analysis of cancer specimens is commonplace, and investigators frequently share or re-use specimens in later studies. Duplicate expression profiles in public databases will impact re-analysis if left undetected, a so-called "doppelg anger" effect.We propose a method that should be routine practice to accurately match duplicate cancer transcriptomes when nucleotide-level sequence data are unavailable, even for samples profiled by different microarray technologies or by both microarray and RNA sequencing. We demonstrate the effectiveness of the method in databases containing dozens of datasets and thousands of ovarian, breast, bladder, and colorectal cancer microarray profiles and of matching microarray and RNA sequencing expression profiles from The Cancer Genome Atlas (TCGA). We identified probable duplicates among more than 50% of studies, originating in different continents, using different technologies, published years apart, and even within the TCGA itself. Finally, we provide the doppelgangR Bioconductor package for screening transcriptome databases for duplicates. Given the potential for unrecognized duplication to falsely inflate prediction accuracy and confidence in differential expression, doppelganger-checking should be a part of standard procedure for combining multiple genomic datasets.

Original languageEnglish (US)
Article numberdjw146
JournalJournal of the National Cancer Institute
Issue number11
StatePublished - 2016

ASJC Scopus subject areas

  • Medicine(all)
  • Oncology
  • Cancer Research

Fingerprint Dive into the research topics of 'The doppelganger effect: Hidden duplicates in databases of transcriptome profiles'. Together they form a unique fingerprint.

Cite this