Review Comment:
This paper surveys available tools for Wikidata subsetting, namely WDSub, KGTK, WDumper, and WDF.
Often, researchers working on Wikidata require *only* a portion of this web-scale knowledge base.
Obtaining the relevant subset of data is not very easy:
(1) by querying Wikidata's SPARQL endpoint: most subsetting-purpose queries will time out.
(2) by downloading the whole Wikidata dump: storing and processing is a problem, especially for personal computers/when fancy resources aren't available
To the best my knowledge, this is the first paper to give an overview of existing Wikidata subsetting tools.
Please find my comments below (not in order of importance) :
- Perhaps say a word or 2 about HDT's periodical compressed Wikidata dumps: https://www.rdfhdt.org/datasets/
- In my opinion, section 1.2 should come before 1.1
- In section 1.1, highlight the main reasons (bold?)
- Section 2 needs a little bit of structure, e.g., subsections such as data format, querying, basic features, unique features, etc.
- Related to previous point: potentially important features to add, Wikidata's "NoValue" and "Unknown" values which are allowed as objects triples, e.g., a childless or stateless entity
- Description of table 1 in text: itemize column descriptions
- An interesting additional column to table 1 would be "documentation availability". The usability of these tools highly depends on the availability of comprehensive documentation with sufficient number of examples. This column could have simple checkmarks as values.
- Mention of your github repository on page 7, include a url as a footnote instead of citation. Also consider adding it in the abstract.
- Table 5, column headers, instead of condition 1, 2, 3.. give them expressive names (even if acronyms of the conditions)
- page 11, line 31, some references about KG repair or inconsistency detection might be relevant (to suggest a couple [1,2]), or maybe a survey on that?
- page 15: perhaps rename schema 1 and 2 to "only referenced instances" and "all instances" respectively
- on evaluating efficiency of tools, it would be interesting to see the difference in retrieval time of the same tool over subsets of different sizes, e.g., retrieving all people-related statements v. actors-related statements
- conclusion, open questions can use some enumeration
Typos/minor:
- p1, line 32 ", including more"
- p1, line 40 "Results show that"
- p2, line 15 "what those are? why we need them?..."
- p3, line 39 "etc., and properties"
- p4, line 36 "or"
- p5, line 25 ", such as DBpedia"
- spaces between lines on page 14 are odd (probably latex related issue)
[1] https://suchanek.name/work/publications/eswc-2021.pdf
[2] https://dl.acm.org/doi/abs/10.1145/3366423.3380014
|