TY - JOUR
T1 - Reproducible big data science
T2 - A case study in continuous FAIRness
AU - Madduri, Ravi
AU - Chard, Kyle
AU - D'arcy, Mike
AU - Jung, Segun C.
AU - Rodriguez, Alexis
AU - Sulakhe, Dinanath
AU - Deutsch, Eric
AU - Funk, Cory
AU - Heavner, Ben
AU - Richards, Matthew
AU - Shannon, Paul
AU - Glusman, Gustavo
AU - Price, Nathan
AU - Kesselman, Carl
AU - Foster, Ian
N1 - Funding Information:
This work was supported in part by NIH contracts 1U54EB020406-01: Big Data for Discovery Science Center, 1OT3OD025458-01: A Commons Platform for Promoting Continuous Fairness, and 5R01HG009018: Hardening Globus Genomics; and DOE contract DE-AC02- 06CH11357. The cloud computing resources were provided by NIH Commons Cloud Credits program and the Amazon Web Services Research Cloud Credits program
Publisher Copyright:
© 2019 Madduri et al.
PY - 2019/4
Y1 - 2019/4
N2 - Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.
AB - Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.
UR - http://www.scopus.com/inward/record.url?scp=85064411644&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064411644&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0213013
DO - 10.1371/journal.pone.0213013
M3 - Article
C2 - 30973881
AN - SCOPUS:85064411644
SN - 1932-6203
VL - 14
JO - PLoS One
JF - PLoS One
IS - 4
M1 - e0213013
ER -