Consistent RNA sequencing contamination in GTEx and other data sets

Tim O. Nieuwenhuis, Stephanie Y. Yang, Rohan X. Verma, Vamsee Pillalamarri, Dan E. Arking, Avi Z. Rosenberg, Matthew N. McCall, Marc K. Halushka

Research output: Contribution to journalArticlepeer-review


A challenge of next generation sequencing is read contamination. We use Genotype-Tissue Expression (GTEx) datasets and technical metadata along with RNA-seq datasets from other studies to understand factors that contribute to contamination. Here we report, of 48 analyzed tissues in GTEx, 26 have variant co-expression clusters of four highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicate contamination. Sample contamination is strongly associated with a sample being sequenced on the same day as a tissue that natively expresses those genes. Discrepant SNPs across four contaminating genes validate the contamination. Low-level contamination affects ~40% of samples and leads to numerous eQTL assignments in inappropriate tissues among these 18 genes. This type of contamination occurs widely, impacting bulk and single cell (scRNA-seq) data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses.

Original languageEnglish (US)
Article number1933
JournalNature communications
Issue number1
StatePublished - Dec 1 2020

ASJC Scopus subject areas

  • Chemistry(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Physics and Astronomy(all)

Fingerprint Dive into the research topics of 'Consistent RNA sequencing contamination in GTEx and other data sets'. Together they form a unique fingerprint.

Cite this