Reproducible big data science: A case study in continuous FAIRness

Ravi Madduri; Kyle Chard; Mike D'arcy; Segun C. Jung; Alexis Rodriguez; Dinanath Sulakhe; Eric Deutsch; Cory Funk; Ben Heavner; Matthew Richards; Paul Shannon; Gustavo Glusman; Nathan Price; Carl Kesselman; Ian Foster

doi:10.1371/journal.pone.0213013

Reproducible big data science: A case study in continuous FAIRness

Ravi Madduri, Kyle Chard, Mike D'arcy, Segun C. Jung, Alexis Rodriguez, Dinanath Sulakhe, Eric Deutsch, Cory Funk, Ben Heavner, Matthew Richards, Paul Shannon, Gustavo Glusman, Nathan Price, Carl Kesselman, Ian Foster

Research output: Contribution to journal › Article › peer-review

10 Scopus citations

Abstract

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.

Original language	English (US)
Article number	e0213013
Journal	PloS one
Volume	14
Issue number	4
DOIs	https://doi.org/10.1371/journal.pone.0213013
State	Published - Apr 2019
Externally published	Yes

ASJC Scopus subject areas

General

Access to Document

10.1371/journal.pone.0213013

Cite this

@article{06ecd1a16a15426ba2a41cfd44ea8799,

title = "Reproducible big data science: A case study in continuous FAIRness",

abstract = "Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.",

author = "Ravi Madduri and Kyle Chard and Mike D'arcy and Jung, {Segun C.} and Alexis Rodriguez and Dinanath Sulakhe and Eric Deutsch and Cory Funk and Ben Heavner and Matthew Richards and Paul Shannon and Gustavo Glusman and Nathan Price and Carl Kesselman and Ian Foster",

note = "Publisher Copyright: {\textcopyright} 2019 Madduri et al.",

year = "2019",

month = apr,

doi = "10.1371/journal.pone.0213013",

language = "English (US)",

volume = "14",

journal = "PloS one",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "4",

}

TY - JOUR

T1 - Reproducible big data science

T2 - A case study in continuous FAIRness

AU - Madduri, Ravi

AU - Chard, Kyle

AU - D'arcy, Mike

AU - Jung, Segun C.

AU - Rodriguez, Alexis

AU - Sulakhe, Dinanath

AU - Deutsch, Eric

AU - Funk, Cory

AU - Heavner, Ben

AU - Richards, Matthew

AU - Shannon, Paul

AU - Glusman, Gustavo

AU - Price, Nathan

AU - Kesselman, Carl

AU - Foster, Ian

PY - 2019/4

Y1 - 2019/4

N2 - Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.

AB - Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.

UR - http://www.scopus.com/inward/record.url?scp=85064411644&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064411644&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0213013

DO - 10.1371/journal.pone.0213013

M3 - Article

C2 - 30973881

AN - SCOPUS:85064411644

SN - 1932-6203

VL - 14

JO - PloS one

JF - PloS one

IS - 4

M1 - e0213013

ER -

Reproducible big data science: A case study in continuous FAIRness

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this