Wikidata subsetting: approaches, tools, and evaluation

Tracking #: 3491-4705

Seyed Amir Hosseini Beghaeiraveri
Jose Emilio Labra-Gayo
Andra Waagmeester
Ammar Ammar1
Carolina Gonzalez
Denise Slenter
Sabah Ul-Hasan
Egon L. Willighagen
Fiona McNeill
Alasdair J G Gray

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Wikidata is a massive Knowledge Graph (KG) including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. The results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Daniel Erenrich submitted on 02/Jul/2023
Review Comment:

Thank you for responding to my concerns. The paper is generally much stronger. I have a few minor points but generally I think the paper is ready for publication. Thank you for your efforts.

Minor points:
The information added to section 4.4 it is very helpful and raises my confidence in the results. I am still a little confused though. The unicode character U+200D is a valid code point, right? Is it being used in an invalid context? Line 11 of the file ‘item-Q29718370-found.json’ is ‘"value": "xmas-1"’ which does not include the character “_” (though I’m unsure exactly which character you are referencing since that looks to me just like a normal underscore which I imagine wouldn’t cause any problems. Maybe supply the unicode codepoint?).

In 5.1 you report that the Wikidata SPARQL endpoint timed out when counting certain types. I’m a little confused why this happened. I ran the SPARQL query:
SELECT (count(?item) as ?c) WHERE{ ?item wdt:P31 wd:Q7187.}
And it returned without a timeout (value was 1,221,255 in 3 seconds). I’m guessing looking at the SPARQL queries in your github repo that it timed out because you were trying to filter out all the fields but that isn’t needed for a simple count. Maybe I’m misunderstanding what the purpose of providing the values is (i.e. demonstrating that extracting the subset using SPARQL isn’t feasible). I know I didn’t mention this in the previous review round and so I don’t think this is a blocking concern but you can populate these numbers if you want.


Page 3 line 44 “In Section 4, the apper investigates the performance…” I presume you meant “paper”.
Page 5 line 18 “Shape Expressions (ShEx) [22] is a structural schema language allowing validation, traversal and transformation of RDF graphs?” Why does this end in a question mark?
Page 8 line 39 “The extracted subsets, along with [???] can be found on Zenodo” missing a word


Review #2
Anonymous submitted on 04/Jul/2023
Review Comment:

This paper surveys available tools for Wikidata subsetting, namely WDSub, KGTK, WDumper, and WDF.
Often, researchers working on Wikidata require *only* a portion of this web-scale knowledge base.
Obtaining the relevant subset of data is not very easy:
(1) by querying Wikidata's SPARQL endpoint: most subsetting-purpose queries will time out.
(2) by downloading the whole Wikidata dump: storing and processing is a problem, especially for personal computers/when fancy resources aren't available

To the best my knowledge, this is the first paper to give an overview of existing Wikidata subsetting tools.

Please find my comments below (not in order of importance) :

- Perhaps say a word or 2 about HDT's periodical compressed Wikidata dumps:
- In my opinion, section 1.2 should come before 1.1
- In section 1.1, highlight the main reasons (bold?)
- Section 2 needs a little bit of structure, e.g., subsections such as data format, querying, basic features, unique features, etc.
- Related to previous point: potentially important features to add, Wikidata's "NoValue" and "Unknown" values which are allowed as objects triples, e.g., a childless or stateless entity
- Description of table 1 in text: itemize column descriptions
- An interesting additional column to table 1 would be "documentation availability". The usability of these tools highly depends on the availability of comprehensive documentation with sufficient number of examples. This column could have simple checkmarks as values.
- Mention of your github repository on page 7, include a url as a footnote instead of citation. Also consider adding it in the abstract.
- Table 5, column headers, instead of condition 1, 2, 3.. give them expressive names (even if acronyms of the conditions)
- page 11, line 31, some references about KG repair or inconsistency detection might be relevant (to suggest a couple [1,2]), or maybe a survey on that?
- page 15: perhaps rename schema 1 and 2 to "only referenced instances" and "all instances" respectively
- on evaluating efficiency of tools, it would be interesting to see the difference in retrieval time of the same tool over subsets of different sizes, e.g., retrieving all people-related statements v. actors-related statements
- conclusion, open questions can use some enumeration

- p1, line 32 ", including more"
- p1, line 40 "Results show that"
- p2, line 15 "what those are? why we need them?..."
- p3, line 39 "etc., and properties"
- p4, line 36 "or"
- p5, line 25 ", such as DBpedia"
- spaces between lines on page 14 are odd (probably latex related issue)