The doppelganger effect: Hidden duplicates in databases of transcriptome profiles

Levi Waldron; Markus Riester; Marcel Ramos; Giovanni Parmigiani; Michael Birrer

doi:10.1093/jnci/djw146

The doppelganger effect: Hidden duplicates in databases of transcriptome profiles

Levi Waldron, Markus Riester, Marcel Ramos, Giovanni Parmigiani, Michael Birrer

School of Medicine

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Whole-genome analysis of cancer specimens is commonplace, and investigators frequently share or re-use specimens in later studies. Duplicate expression profiles in public databases will impact re-analysis if left undetected, a so-called "doppelg anger" effect.We propose a method that should be routine practice to accurately match duplicate cancer transcriptomes when nucleotide-level sequence data are unavailable, even for samples profiled by different microarray technologies or by both microarray and RNA sequencing. We demonstrate the effectiveness of the method in databases containing dozens of datasets and thousands of ovarian, breast, bladder, and colorectal cancer microarray profiles and of matching microarray and RNA sequencing expression profiles from The Cancer Genome Atlas (TCGA). We identified probable duplicates among more than 50% of studies, originating in different continents, using different technologies, published years apart, and even within the TCGA itself. Finally, we provide the doppelgangR Bioconductor package for screening transcriptome databases for duplicates. Given the potential for unrecognized duplication to falsely inflate prediction accuracy and confidence in differential expression, doppelganger-checking should be a part of standard procedure for combining multiple genomic datasets.

Original language	English (US)
Article number	djw146
Journal	Journal of the National Cancer Institute
Volume	108
Issue number	11
DOIs	https://doi.org/10.1093/jnci/djw146
State	Published - 2016

ASJC Scopus subject areas

General Medicine
Oncology
Cancer Research

Access to Document

10.1093/jnci/djw146

Cite this

@article{a615a5d14ad040beb2415a933e873001,

title = "The doppelganger effect: Hidden duplicates in databases of transcriptome profiles",

abstract = "Whole-genome analysis of cancer specimens is commonplace, and investigators frequently share or re-use specimens in later studies. Duplicate expression profiles in public databases will impact re-analysis if left undetected, a so-called {"}doppelg anger{"} effect.We propose a method that should be routine practice to accurately match duplicate cancer transcriptomes when nucleotide-level sequence data are unavailable, even for samples profiled by different microarray technologies or by both microarray and RNA sequencing. We demonstrate the effectiveness of the method in databases containing dozens of datasets and thousands of ovarian, breast, bladder, and colorectal cancer microarray profiles and of matching microarray and RNA sequencing expression profiles from The Cancer Genome Atlas (TCGA). We identified probable duplicates among more than 50% of studies, originating in different continents, using different technologies, published years apart, and even within the TCGA itself. Finally, we provide the doppelgangR Bioconductor package for screening transcriptome databases for duplicates. Given the potential for unrecognized duplication to falsely inflate prediction accuracy and confidence in differential expression, doppelganger-checking should be a part of standard procedure for combining multiple genomic datasets.",

author = "Levi Waldron and Markus Riester and Marcel Ramos and Giovanni Parmigiani and Michael Birrer",

year = "2016",

doi = "10.1093/jnci/djw146",

language = "English (US)",

volume = "108",

journal = "Journal of the National Cancer Institute",

issn = "0027-8874",

publisher = "Oxford University Press",

number = "11",

}

TY - JOUR

T1 - The doppelganger effect

T2 - Hidden duplicates in databases of transcriptome profiles

AU - Waldron, Levi

AU - Riester, Markus

AU - Ramos, Marcel

AU - Parmigiani, Giovanni

AU - Birrer, Michael

PY - 2016

Y1 - 2016

N2 - Whole-genome analysis of cancer specimens is commonplace, and investigators frequently share or re-use specimens in later studies. Duplicate expression profiles in public databases will impact re-analysis if left undetected, a so-called "doppelg anger" effect.We propose a method that should be routine practice to accurately match duplicate cancer transcriptomes when nucleotide-level sequence data are unavailable, even for samples profiled by different microarray technologies or by both microarray and RNA sequencing. We demonstrate the effectiveness of the method in databases containing dozens of datasets and thousands of ovarian, breast, bladder, and colorectal cancer microarray profiles and of matching microarray and RNA sequencing expression profiles from The Cancer Genome Atlas (TCGA). We identified probable duplicates among more than 50% of studies, originating in different continents, using different technologies, published years apart, and even within the TCGA itself. Finally, we provide the doppelgangR Bioconductor package for screening transcriptome databases for duplicates. Given the potential for unrecognized duplication to falsely inflate prediction accuracy and confidence in differential expression, doppelganger-checking should be a part of standard procedure for combining multiple genomic datasets.

AB - Whole-genome analysis of cancer specimens is commonplace, and investigators frequently share or re-use specimens in later studies. Duplicate expression profiles in public databases will impact re-analysis if left undetected, a so-called "doppelg anger" effect.We propose a method that should be routine practice to accurately match duplicate cancer transcriptomes when nucleotide-level sequence data are unavailable, even for samples profiled by different microarray technologies or by both microarray and RNA sequencing. We demonstrate the effectiveness of the method in databases containing dozens of datasets and thousands of ovarian, breast, bladder, and colorectal cancer microarray profiles and of matching microarray and RNA sequencing expression profiles from The Cancer Genome Atlas (TCGA). We identified probable duplicates among more than 50% of studies, originating in different continents, using different technologies, published years apart, and even within the TCGA itself. Finally, we provide the doppelgangR Bioconductor package for screening transcriptome databases for duplicates. Given the potential for unrecognized duplication to falsely inflate prediction accuracy and confidence in differential expression, doppelganger-checking should be a part of standard procedure for combining multiple genomic datasets.

UR - http://www.scopus.com/inward/record.url?scp=85014813622&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85014813622&partnerID=8YFLogxK

U2 - 10.1093/jnci/djw146

DO - 10.1093/jnci/djw146

M3 - Article

C2 - 27381624

AN - SCOPUS:85014813622

SN - 0027-8874

VL - 108

JO - Journal of the National Cancer Institute

JF - Journal of the National Cancer Institute

IS - 11

M1 - djw146

ER -

The doppelganger effect: Hidden duplicates in databases of transcriptome profiles

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this