Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation

Katherine Thornton; Harold Solbrig; Gregory S. Stupp; Jose Emilio Labra Gayo; Daniel Mietchen; Eric Prud’hommeaux; Andra Waagmeester

doi:10.1007/978-3-030-21348-0_39

Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation

Katherine Thornton, Harold Solbrig, Gregory S. Stupp, Jose Emilio Labra Gayo, Daniel Mietchen, Eric Prud’hommeaux, Andra Waagmeester

School of Medicine

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

10 Scopus citations

Abstract

We discuss Shape Expressions (ShEx), a concise, formal, modeling and validation language for RDF structures. For instance, a Shape Expression could prescribe that subjects in a given RDF graph that fall into the shape “Paper” are expected to have a section called “Abstract”, and any ShEx implementation can confirm whether that is indeed the case for all such subjects within a given graph or subgraph. There are currently five actively maintained ShEx implementations. We discuss how we use the JavaScript, Scala and Python implementations in RDF data validation workflows in distinct, applied contexts. We present examples of how ShEx can be used to model and validate data from two different sources, the domain-specific Fast Healthcare Interoperability Resources (FHIR) and the domain-generic Wikidata knowledge base, which is the linked database built and maintained by the Wikimedia Foundation as a sister project to Wikipedia. Example projects that are using Wikidata as a data curation platform are presented as well, along with ways in which they are using ShEx for modeling and validation. When reusing RDF graphs created by others, it is important to know how the data is represented. Current practices of using human-readable descriptions or ontologies to communicate data structures often lack sufficient precision for data consumers to quickly and easily understand data representation details. We provide concrete examples of how we use ShEx as a constraint and validation language that allows humans and machines to communicate unambiguously about data assets. We use ShEx to exchange and understand data models of different origins, and to express a shared model of a resource’s footprint in a Linked Data source. We also use ShEx to agilely develop data models, test them against sample data, and revise or refine them. The expressivity of ShEx allows us to catch disagreement, inconsistencies, or errors efficiently, both at the time of input, and through batch inspections. ShEx addresses the need of the Semantic Web community to ensure data quality for RDF graphs. It is currently being used in the development of FHIR/RDF. The language is sufficiently expressive to capture constraints in FHIR, and the intuitive syntax helps people to quickly grasp the range of conformant documents. The publication workflow for FHIR tests all of these examples against the ShEx schemas, catching non-conformant data before they reach the public. ShEx is also currently used in Wikidata projects such as Gene Wiki and WikiCite to develop quality-control pipelines to maintain data integrity and incorporate or harmonize differences in data across different parts of the pipelines.

Original language	English (US)
Title of host publication	The Semantic Web - 16th International Conference, ESWC 2019, Proceedings
Editors	Pascal Hitzler, Miriam Fernández, Krzysztof Janowicz, Amrapali Zaveri, Alasdair J.G. Gray, Vanessa Lopez, Armin Haller, Karl Hammar
Publisher	Springer Verlag
Pages	606-620
Number of pages	15
ISBN (Print)	9783030213473
DOIs	https://doi.org/10.1007/978-3-030-21348-0_39
State	Published - 2019
Event	16th International Semantic Web Conference, ESWC 2019 - Portorož, Slovenia Duration: Jun 2 2019 → Jun 6 2019

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	11503 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	16th International Semantic Web Conference, ESWC 2019
Country/Territory	Slovenia
City	Portorož
Period	6/2/19 → 6/6/19

Keywords

Digital preservation wd:Q632897
FHIR wd:Q19597236
RDF wd:Q54872
ShEx wd:Q29377880
Wikidata wd:Q2013

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-030-21348-0_39

Cite this

Thornton, K., Solbrig, H., Stupp, G. S., Labra Gayo, J. E., Mietchen, D., Prud’hommeaux, E., & Waagmeester, A. (2019). Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation. In P. Hitzler, M. Fernández, K. Janowicz, A. Zaveri, A. J. G. Gray, V. Lopez, A. Haller, & K. Hammar (Eds.), The Semantic Web - 16th International Conference, ESWC 2019, Proceedings (pp. 606-620). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11503 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-030-21348-0_39

Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation. / Thornton, Katherine; Solbrig, Harold; Stupp, Gregory S. et al.
The Semantic Web - 16th International Conference, ESWC 2019, Proceedings. ed. / Pascal Hitzler; Miriam Fernández; Krzysztof Janowicz; Amrapali Zaveri; Alasdair J.G. Gray; Vanessa Lopez; Armin Haller; Karl Hammar. Springer Verlag, 2019. p. 606-620 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11503 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Thornton, K, Solbrig, H, Stupp, GS, Labra Gayo, JE, Mietchen, D, Prud’hommeaux, E & Waagmeester, A 2019, Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation. in P Hitzler, M Fernández, K Janowicz, A Zaveri, AJG Gray, V Lopez, A Haller & K Hammar (eds), The Semantic Web - 16th International Conference, ESWC 2019, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11503 LNCS, Springer Verlag, pp. 606-620, 16th International Semantic Web Conference, ESWC 2019, Portorož, Slovenia, 6/2/19. https://doi.org/10.1007/978-3-030-21348-0_39

Thornton K, Solbrig H, Stupp GS, Labra Gayo JE, Mietchen D, Prud’hommeaux E et al. Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation. In Hitzler P, Fernández M, Janowicz K, Zaveri A, Gray AJG, Lopez V, Haller A, Hammar K, editors, The Semantic Web - 16th International Conference, ESWC 2019, Proceedings. Springer Verlag. 2019. p. 606-620. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-21348-0_39

Thornton, Katherine ; Solbrig, Harold ; Stupp, Gregory S. et al. / Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation. The Semantic Web - 16th International Conference, ESWC 2019, Proceedings. editor / Pascal Hitzler ; Miriam Fernández ; Krzysztof Janowicz ; Amrapali Zaveri ; Alasdair J.G. Gray ; Vanessa Lopez ; Armin Haller ; Karl Hammar. Springer Verlag, 2019. pp. 606-620 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{2f891d57478f4791920aa24ad781eaf1,

title = "Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation",

abstract = "We discuss Shape Expressions (ShEx), a concise, formal, modeling and validation language for RDF structures. For instance, a Shape Expression could prescribe that subjects in a given RDF graph that fall into the shape “Paper” are expected to have a section called “Abstract”, and any ShEx implementation can confirm whether that is indeed the case for all such subjects within a given graph or subgraph. There are currently five actively maintained ShEx implementations. We discuss how we use the JavaScript, Scala and Python implementations in RDF data validation workflows in distinct, applied contexts. We present examples of how ShEx can be used to model and validate data from two different sources, the domain-specific Fast Healthcare Interoperability Resources (FHIR) and the domain-generic Wikidata knowledge base, which is the linked database built and maintained by the Wikimedia Foundation as a sister project to Wikipedia. Example projects that are using Wikidata as a data curation platform are presented as well, along with ways in which they are using ShEx for modeling and validation. When reusing RDF graphs created by others, it is important to know how the data is represented. Current practices of using human-readable descriptions or ontologies to communicate data structures often lack sufficient precision for data consumers to quickly and easily understand data representation details. We provide concrete examples of how we use ShEx as a constraint and validation language that allows humans and machines to communicate unambiguously about data assets. We use ShEx to exchange and understand data models of different origins, and to express a shared model of a resource{\textquoteright}s footprint in a Linked Data source. We also use ShEx to agilely develop data models, test them against sample data, and revise or refine them. The expressivity of ShEx allows us to catch disagreement, inconsistencies, or errors efficiently, both at the time of input, and through batch inspections. ShEx addresses the need of the Semantic Web community to ensure data quality for RDF graphs. It is currently being used in the development of FHIR/RDF. The language is sufficiently expressive to capture constraints in FHIR, and the intuitive syntax helps people to quickly grasp the range of conformant documents. The publication workflow for FHIR tests all of these examples against the ShEx schemas, catching non-conformant data before they reach the public. ShEx is also currently used in Wikidata projects such as Gene Wiki and WikiCite to develop quality-control pipelines to maintain data integrity and incorporate or harmonize differences in data across different parts of the pipelines.",

keywords = "Digital preservation wd:Q632897, FHIR wd:Q19597236, RDF wd:Q54872, ShEx wd:Q29377880, Wikidata wd:Q2013",

author = "Katherine Thornton and Harold Solbrig and Stupp, {Gregory S.} and {Labra Gayo}, {Jose Emilio} and Daniel Mietchen and Eric Prud{\textquoteright}hommeaux and Andra Waagmeester",

note = "Publisher Copyright: {\textcopyright} Springer Nature Switzerland AG 2019.; 16th International Semantic Web Conference, ESWC 2019 ; Conference date: 02-06-2019 Through 06-06-2019",

year = "2019",

doi = "10.1007/978-3-030-21348-0_39",

language = "English (US)",

isbn = "9783030213473",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "606--620",

editor = "Pascal Hitzler and Miriam Fern{\'a}ndez and Krzysztof Janowicz and Amrapali Zaveri and Gray, {Alasdair J.G.} and Vanessa Lopez and Armin Haller and Karl Hammar",

booktitle = "The Semantic Web - 16th International Conference, ESWC 2019, Proceedings",

}

TY - GEN

T1 - Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation

AU - Thornton, Katherine

AU - Solbrig, Harold

AU - Stupp, Gregory S.

AU - Labra Gayo, Jose Emilio

AU - Mietchen, Daniel

AU - Prud’hommeaux, Eric

AU - Waagmeester, Andra

N1 - Publisher Copyright: © Springer Nature Switzerland AG 2019.

PY - 2019

Y1 - 2019

N2 - We discuss Shape Expressions (ShEx), a concise, formal, modeling and validation language for RDF structures. For instance, a Shape Expression could prescribe that subjects in a given RDF graph that fall into the shape “Paper” are expected to have a section called “Abstract”, and any ShEx implementation can confirm whether that is indeed the case for all such subjects within a given graph or subgraph. There are currently five actively maintained ShEx implementations. We discuss how we use the JavaScript, Scala and Python implementations in RDF data validation workflows in distinct, applied contexts. We present examples of how ShEx can be used to model and validate data from two different sources, the domain-specific Fast Healthcare Interoperability Resources (FHIR) and the domain-generic Wikidata knowledge base, which is the linked database built and maintained by the Wikimedia Foundation as a sister project to Wikipedia. Example projects that are using Wikidata as a data curation platform are presented as well, along with ways in which they are using ShEx for modeling and validation. When reusing RDF graphs created by others, it is important to know how the data is represented. Current practices of using human-readable descriptions or ontologies to communicate data structures often lack sufficient precision for data consumers to quickly and easily understand data representation details. We provide concrete examples of how we use ShEx as a constraint and validation language that allows humans and machines to communicate unambiguously about data assets. We use ShEx to exchange and understand data models of different origins, and to express a shared model of a resource’s footprint in a Linked Data source. We also use ShEx to agilely develop data models, test them against sample data, and revise or refine them. The expressivity of ShEx allows us to catch disagreement, inconsistencies, or errors efficiently, both at the time of input, and through batch inspections. ShEx addresses the need of the Semantic Web community to ensure data quality for RDF graphs. It is currently being used in the development of FHIR/RDF. The language is sufficiently expressive to capture constraints in FHIR, and the intuitive syntax helps people to quickly grasp the range of conformant documents. The publication workflow for FHIR tests all of these examples against the ShEx schemas, catching non-conformant data before they reach the public. ShEx is also currently used in Wikidata projects such as Gene Wiki and WikiCite to develop quality-control pipelines to maintain data integrity and incorporate or harmonize differences in data across different parts of the pipelines.

AB - We discuss Shape Expressions (ShEx), a concise, formal, modeling and validation language for RDF structures. For instance, a Shape Expression could prescribe that subjects in a given RDF graph that fall into the shape “Paper” are expected to have a section called “Abstract”, and any ShEx implementation can confirm whether that is indeed the case for all such subjects within a given graph or subgraph. There are currently five actively maintained ShEx implementations. We discuss how we use the JavaScript, Scala and Python implementations in RDF data validation workflows in distinct, applied contexts. We present examples of how ShEx can be used to model and validate data from two different sources, the domain-specific Fast Healthcare Interoperability Resources (FHIR) and the domain-generic Wikidata knowledge base, which is the linked database built and maintained by the Wikimedia Foundation as a sister project to Wikipedia. Example projects that are using Wikidata as a data curation platform are presented as well, along with ways in which they are using ShEx for modeling and validation. When reusing RDF graphs created by others, it is important to know how the data is represented. Current practices of using human-readable descriptions or ontologies to communicate data structures often lack sufficient precision for data consumers to quickly and easily understand data representation details. We provide concrete examples of how we use ShEx as a constraint and validation language that allows humans and machines to communicate unambiguously about data assets. We use ShEx to exchange and understand data models of different origins, and to express a shared model of a resource’s footprint in a Linked Data source. We also use ShEx to agilely develop data models, test them against sample data, and revise or refine them. The expressivity of ShEx allows us to catch disagreement, inconsistencies, or errors efficiently, both at the time of input, and through batch inspections. ShEx addresses the need of the Semantic Web community to ensure data quality for RDF graphs. It is currently being used in the development of FHIR/RDF. The language is sufficiently expressive to capture constraints in FHIR, and the intuitive syntax helps people to quickly grasp the range of conformant documents. The publication workflow for FHIR tests all of these examples against the ShEx schemas, catching non-conformant data before they reach the public. ShEx is also currently used in Wikidata projects such as Gene Wiki and WikiCite to develop quality-control pipelines to maintain data integrity and incorporate or harmonize differences in data across different parts of the pipelines.

KW - Digital preservation wd:Q632897

KW - FHIR wd:Q19597236

KW - RDF wd:Q54872

KW - ShEx wd:Q29377880

KW - Wikidata wd:Q2013

UR - http://www.scopus.com/inward/record.url?scp=85066789636&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066789636&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-21348-0_39

DO - 10.1007/978-3-030-21348-0_39

M3 - Conference contribution

AN - SCOPUS:85066789636

SN - 9783030213473

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 606

EP - 620

BT - The Semantic Web - 16th International Conference, ESWC 2019, Proceedings

A2 - Hitzler, Pascal

A2 - Fernández, Miriam

A2 - Janowicz, Krzysztof

A2 - Zaveri, Amrapali

A2 - Gray, Alasdair J.G.

A2 - Lopez, Vanessa

A2 - Haller, Armin

A2 - Hammar, Karl

PB - Springer Verlag

T2 - 16th International Semantic Web Conference, ESWC 2019

Y2 - 2 June 2019 through 6 June 2019

ER -

Using shape expressions (ShEx) to share rdf data models and to guide curation with rigorous validation

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this