Evidence of Large-Scale Conceptual Disarray in Multi-Level Taxonomies in Wikidata

Tracking #: 3480-4694

Authors: 
Atílio A. Dadalto
João Paulo A. Almeida
Claudenir M. Fonseca
Giancarlo Guizzardi1

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Abstract: 
The distinction between types and individuals is key to most conceptual modeling techniques and knowledge representation languages. Despite that, there are a number of situations in which modelers navigate this distinction inadequately, leading to problematic models. We show evidence of a large number of representation mistakes associated with the failure to employ this distinction in the Wikidata knowledge graph, which can be identified with the incorrect use of instantiation, which is a relation between an instance and a type, and specialization (or subtyping), which is a relation between two types. The prevalence of the problems in Wikidata's taxonomies suggests that methodological and computational tools are required to mitigate the issues identified, which occur in many settings when individuals, types, and their metatypes are included in the domain of interest. We conduct a conceptual analysis of entities involved in recurrent erroneous cases identified in this empirical data, and present a tool that supports users in avoiding some of these mistakes.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 12/Jul/2023
Suggestion:
Major Revision
Review Comment:

This submission studies the conceptual disarray in multi-level taxonomies in Wikidata, with a particular focus on two anti-patterns that capture confusion between subclasses and instances. The paper provides a detailed description of Wikidata's taxonomical organization, and analyzes how common the two anti-patterns are. These results are supplemented by an analysis and an initial tool that can be used to capture anti-patterns proactively at knowledge insertion time.

The overall idea of analyzing anti-patterns and integrating them into the Wikidata ecosystem is an attractive one, as there is a recognized confusion in the community with Wikidata's taxonomical organization. The analysis of the paper is interesting and informative, and the tool is a welcome supplement to help editors and knowledge engineers do their job more effectively.

I suggest that the following aspects should be improved to enhance the quality of the paper.

1. Motivation - the authors indicate that similar prior works exist (one of which is by a subset of the authors), which brings up the question about the novelty of this work. The introduction of the paper does not motivate the need for this study sufficiently. Why is this study needed, what gaps in existing research does it fill? Why is it interesting to run a similar experiment as in 2016 now? Some of these questions are touched on in the Related work section, but I wish that the authors summarized these deltas in the introduction as well to help the reader understand the significance of this work.

2. Generalizability/impact - The paper indicates that there are various anti-patterns, two of which are studied in this work. This brings up the questions: why these two anti-patterns; how many other anti-patterns are there; would these investigations generalize to the other anti-patterns and is that relevant to do in the future? In addition, the anti-patterns are not formally defined (the paper refers the reader to other works for this), which also harms the manuscript. I would suggest including a taxonomy of the patterns, with a formal definition at least of the two that are covered in this work, and a crisp motivation for the choice of these two anti-patterns.

3. Insights - The paper has a lot of interesting examples that manifest the challenge of the conceptual modeling of knowledge in Wikidata. Meanwhile, it would have been good if the authors have organized these examples into high-level insights/findings - this would help the reader have clearer takeaways from this paper, rather than merely reading about several (perhaps isolated) cases. Moreover, the analysis should IMHO provide more insights into the conceptual disarray - for instance, are they typically caused by bots or by manual edits? Do they also correlate with constraint violations in Wikidata?

4. Problems vs solutions - The paper points to many problematic cases, yet, there is very little information about what the solutions to these problems (manual, semi-, or automatic) would look like. This is also reflected in the tool, which points out violations of anti-patterns, but it stops there. Following some of the discussion of different ways to resolve some of the violations, I was expecting that the tool will provide a mechanism to assist editors to mitigate these violations more efficiently. Without that, it is questionable how useful the tool would be in practice. In fact, the concluding remarks section states that the methodology and the computational tool help users, but there is no evaluation of whether this is true for users at all, making this claim unsupported.

Other:
* "On the other hand" requires there to be an "On the one hand" clause before
* Footnotes should come after punctuation marks, not before

Review #2
By Masaharu Yoshioka submitted on 13/Jul/2023
Suggestion:
Minor Revision
Review Comment:

This paper analyses the current situation of Wikidata about the relation types (instance-of and subclass-of). They point out that there are many incorrect uses of such relations from their perspective. They also make a system to identify them. The idea is interesting.

It is still controversial to keep the consistency of Wikidata, which has been maintained by the variety of Wikidata editors who have different views to describe their own knowledge. However, as the author mentioned in the paper, WAPA may provide some insight to understand how their knowledge can be placed in the Wikidata knowledge structure.

Overall evaluation of the revised paper is accept, because they update their paper based on my comments.

Review #3
Anonymous submitted on 08/Sep/2023
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

This paper presents a conceptual analysis of entities involved in recurrent erroneous cases identified with the incorrect use of instantiation, which is a relation between an instance and a type, and specialization, which is a relation between two types. It proposes a tool that supports users in avoiding some of these mistakes.

The authors have responded convincingly and addressed the various questions and suggestions raised in the first round of evaluation:

- better motivating the work in the introduction
- clearly identifying the research questions (Section 3)
- introducing a section better discussing the anti-patterns (Section 3.1)
- clear positioning with respect to the previous work (short paper ER conference)
- extending the analysis of the introduction of the OpenCyc meta-classes
- introducing a new section dedicated to the related work (Section 6)

For that reasons, I consider that the paper is now in shape for publication.