Big data reproducibility: Applications in brain imaging

Consortium for Reliability and Reproduciblity

Research output: Contribution to journalArticlepeer-review


Reproducibility, the ability to replicate analytical findings, is a prerequisite for both scientific discovery and clinical utility. Troublingly, we are in the midst of a reproducibility crisis, in which many investigations fail to replicate. Although many believe that these failings are due to misunderstanding or misapplication of statistical inference (e.g., p-values or the dichotomization of “statistically significant”), we believe the shortcomings arise much earlier in the data science workflow, at the level of measurement, including data acquisition and reconstruction. A key to reproducibility is that multiple measurements of the same item (e.g., experimental sample or clinical participant) are similar to one another, while they are dissimilar from other items. The intra-class correlation coefficient (ICC) quantifies reproducibility in this way, but only for univariate (one dimensional) Gaussian data. In contrast, big data is multivariate (high-dimensional), non-Gaussian, and often non-Euclidean (including text, images, speech, and networks), rendering ICC inadequate. We propose a novel statistic, discriminability, which quantifies the degree to which individual samples are discriminable from one another, without restricting the data to be univariate, Gaussian, or even Euclidean. We then introduce the possibility of optimizing experimental design via increasing discriminability. We prove that optimizing discriminability yields an improved ability to use the data for subsequent inference tasks, without specifying the inference task a priori. We then apply this approach to a brain imaging dataset built by the “Consortium for Reliability and reproducibility” which consists of 28 disparate magnetic resonance imaging datasets. Optimizing discriminability improves performance on multiple subsequent inference tasks, despite that they were not considered in the optimization. We therefore suggest that designing experiments and analyses to optimize discriminability may be a crucial step in solving the reproducibility crisis.

Original languageEnglish (US)
JournalUnknown Journal
StatePublished - Oct 13 2019

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)
  • Immunology and Microbiology(all)
  • Neuroscience(all)
  • Pharmacology, Toxicology and Pharmaceutics(all)

Fingerprint Dive into the research topics of 'Big data reproducibility: Applications in brain imaging'. Together they form a unique fingerprint.

Cite this