Formalizing and Validating Wikidata's Property Constraints using SHACL+SPARQL

Tracking #: 3378-4592

This paper is currently under review
Nicolas Ferranti
Jairo Francisco De Souza1
Axel Polleres

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
In this paper, we delve into the crucial role of constraints in maintaining data integrity in knowledge graphs with a specific focus on Wikidata, one of the largest collaboratively open data knowledge graphs on the web. Despite the availability of a W3C recommendation for validating RDF Knowledge Graphs against constraints via the Shapes Constraint Language (SHACL), however, Wikidata currently represents its property constraints through its own RDF data model, using proprietary authoritative namespaces, and -- partially ambiguous -- natural language definitions. In order to close this gap, we investigate the semantics of Wikidata property constraints, by formalizing them using SHACL and SPARQL. While SHACL Core's expressivity turns out to be insufficient for expressing all Wikidata property constraint types, we present SPARQL queries to identify violations for all current Wikidata constraint types. We compare the semantics of this unambiguous SPARQL formalisation with Wikidata's violation reporting system and discuss limitations in terms of evaluation via Wikidata's SPARQL query endpoint, due to its current scalability. Our study, on the one hand, sheds light on the unique characteristics of constraints in Wikidata that potentially have implications for future efforts to improve the quality and accuracy of data in collaborative knowledge graphs. On the other hand, as a ``byproduct'', our formalisation extends existing benchmarks for both SHACL and SPARQL with a challenging, large scale real-world use case.
Full PDF Version: 
Under Review