Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques

Lisa M. Gandy, Jordan Gumm, Benjamin Fertig, Anne Thessen, Michael J. Kennish, Sameer Chavan, Luigi Marchionni, Xiaoxin Xia, Shambhavi Shankrit, Elana Fertig

Research output: Contribution to journalArticle

Abstract

Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100%). We further implement Synthesize in an open source web application, Synthesizer (https:// github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.

Original languageEnglish (US)
Article numbere0175860
JournalPLoS One
Volume12
Issue number4
DOIs
StatePublished - Apr 1 2017

Fingerprint

information retrieval
Information Storage and Retrieval
Spreadsheets
Information retrieval
synthesis
user interface
Software
Graphical user interfaces
methodology
Research Personnel
Databases
Labels
Costs and Cost Analysis
neoplasms
Research
Datasets
Neoplasms
Costs

ASJC Scopus subject areas

  • Medicine(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

Synthesizer : Expediting synthesis studies from context-free data with information retrieval techniques. / Gandy, Lisa M.; Gumm, Jordan; Fertig, Benjamin; Thessen, Anne; Kennish, Michael J.; Chavan, Sameer; Marchionni, Luigi; Xia, Xiaoxin; Shankrit, Shambhavi; Fertig, Elana.

In: PLoS One, Vol. 12, No. 4, e0175860, 01.04.2017.

Research output: Contribution to journalArticle

Gandy, LM, Gumm, J, Fertig, B, Thessen, A, Kennish, MJ, Chavan, S, Marchionni, L, Xia, X, Shankrit, S & Fertig, E 2017, 'Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques', PLoS One, vol. 12, no. 4, e0175860. https://doi.org/10.1371/journal.pone.0175860
Gandy, Lisa M. ; Gumm, Jordan ; Fertig, Benjamin ; Thessen, Anne ; Kennish, Michael J. ; Chavan, Sameer ; Marchionni, Luigi ; Xia, Xiaoxin ; Shankrit, Shambhavi ; Fertig, Elana. / Synthesizer : Expediting synthesis studies from context-free data with information retrieval techniques. In: PLoS One. 2017 ; Vol. 12, No. 4.
@article{5f3cb94434734ba8ae8716566c569bdb,
title = "Synthesizer: Expediting synthesis studies from context-free data with information retrieval techniques",
abstract = "Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100{\%}). We further implement Synthesize in an open source web application, Synthesizer (https:// github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.",
author = "Gandy, {Lisa M.} and Jordan Gumm and Benjamin Fertig and Anne Thessen and Kennish, {Michael J.} and Sameer Chavan and Luigi Marchionni and Xiaoxin Xia and Shambhavi Shankrit and Elana Fertig",
year = "2017",
month = "4",
day = "1",
doi = "10.1371/journal.pone.0175860",
language = "English (US)",
volume = "12",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "4",

}

TY - JOUR

T1 - Synthesizer

T2 - Expediting synthesis studies from context-free data with information retrieval techniques

AU - Gandy, Lisa M.

AU - Gumm, Jordan

AU - Fertig, Benjamin

AU - Thessen, Anne

AU - Kennish, Michael J.

AU - Chavan, Sameer

AU - Marchionni, Luigi

AU - Xia, Xiaoxin

AU - Shankrit, Shambhavi

AU - Fertig, Elana

PY - 2017/4/1

Y1 - 2017/4/1

N2 - Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100%). We further implement Synthesize in an open source web application, Synthesizer (https:// github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.

AB - Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combining records in spreadsheets from differing studies requires tedious and error-prone human curation. These efforts result in a significant time and cost barrier to synthesis research. We propose an information retrieval inspired algorithm, Synthesize, that merges unstructured data automatically based on both column labels and values. Application of the Synthesize algorithm to cancer and ecological datasets had high accuracy (on the order of 85-100%). We further implement Synthesize in an open source web application, Synthesizer (https:// github.com/lisagandy/synthesizer). The software accepts input as spreadsheets in comma separated value (CSV) format, visualizes the merged data, and outputs the results as a new spreadsheet. Synthesizer includes an easy to use graphical user interface, which enables the user to finish combining data and obtain perfect accuracy. Future work will allow detection of units to automatically merge continuous data and application of the algorithm to other data formats, including databases.

UR - http://www.scopus.com/inward/record.url?scp=85018585958&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85018585958&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0175860

DO - 10.1371/journal.pone.0175860

M3 - Article

C2 - 28437440

AN - SCOPUS:85018585958

VL - 12

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 4

M1 - e0175860

ER -