Wikidata Subsetting: Approaches, Tools, and Evaluation

Tracking #: 3372-4586

This paper is currently under review
Seyed Amir Hosseini Beghaeiraveri
Jose Emilio Labra-Gayo
Andra Waagmeester
Ammar Ammar1
Carolina Gonzalez
Denise Slenter
Sabah Ul-Hasan
Egon L. Willighagen
Fiona McNeill1
Alasdair J G Gray

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Wikidata is a massive Knowledge Graph (KG) including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting -- WDSub, KGTK, WDumper, and WDF -- in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. The results show that all four tools have close and appropriate accuracy more than \%95. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, we defined multiple subset use cases and analyzed the extracted subsets, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.
Full PDF Version: 
Under Review