Review Comment:
In my understanding, this paper proposes a methodology using a number of Wikidata “mechanisms” to asses if we can trust the data in Wikidata. The authors use information related to statements (i.e. “A Statement is a piece of data about an item, recorded on the item's page.” (https://www.wikidata.org/wiki/Wikidata:Glossary)). They present descriptive statistics for the properties including specific qualifiers, statement rank (normal, preferred, deprecated), and references. The authors argue in the end that these mechanisms are not trusted enough, and they should be improved.
The subject of this research, data quality, is essential and gives great value to the field and the Wikidata KG. This work aims to propose a new aspect to examine whether we can trust the data; presents an overview of statistics related to properties; suggests interesting future directions for research. However, I found this paper difficult to follow, and it is unclear to me if I have understood the full methodology and the results correctly.
General comments:
- The authors may need to consider organising the paper differently. A more simple structure with standard sections, e.g., (1) Introduction, (2) Background, (3) Related work, (4) Data, (5) Methodology, (6) Results, (7) Discussion, and (8) Conclusion and Future work could improve the narrative of the paper and give a clear understanding to the methodology and results.
- The paper could also benefit from concrete paragraphs. In many cases, there are one-sentence paragraphs usually connected with the previous or the next paragraph. For example, Section 5.1 Discussion is full of them. Unclear paragraphs made it hard to follow the narrative of the sections. A nice change would be, for each section, to write down a number of topics that the authors want to mention there and then write a paragraph for each of these topics. Small and simple sentences could also improve understanding.
- It would be more transparent if the authors considered adding more references to their arguments. For example, on page line 36 about applications, page 2 lines 42-46, etc.
Introduction - Data in Wikidata:
- I would appreciate it if the authors keep this section as “Introduction” and remove “Data in Wikidata”. I believe the authors here aim to explain (1) why the subject of the paper is important?, (2) what others did about this?, (3) what the authors did in this paper?. However, it is hard for the reader to understand the narrative. This sectioning could benefit from better paragraph structure as to the previous indication, and some extra information about contributions and implications to understand the value of this work.
- Lines 38-45 could be easier to understand if the authors first explain the terminology of Wikidata. For example, Figure 1 shows an example of an item. The figure could benefit from signs to highlight the item, claim, statement, property, qualifiers etc. I suggest including a section Background to describe Wikidata practices like terminology and all the descriptions we find on the section Motivation - Incongruences in Wikidata. After understanding how Wikidata works, the reader can follow the methodology and results.
Motivation - Incongruences in Wikidata:
- This section describes Wikidata. I suggest changing the title to “Background” and providing all Wikidata practices described in multiple places in the paper (constraints, qualifiers, the property “disputed by” etc). Clear examples for every case and highlighted Wikidata screenshots could improve the understanding.
- This section is mixed of Wikidata descriptions and methodology. It would be more clear if the authors could for example describe what is constraints in Section Background, and how and why they use constraints in Section Methodology.
- The authors present three types of situations to investigate trust decisions: incompleteness, incongruences, and controversies. In my understanding, these three are related to qualifiers, but this is not clear in the manuscript. This information may need to be described in Section Methodology
The trust process using WD :
- This Section starts to explain the methodology. It describes the information used for the methodology, qualifiers, statements rank, and references. It would be more clear if the authors could explain more about their methodology after the terminology page 4 line 8. It could be useful to connect here the qualifiers with the completeness, incongruences, and controversies.
- The authors mention here that they call this methodology “TrustLayer”. However, in Abstract and Future work, they call it “KG profiling”. I would suggest keeping one name that would be introduced in the Abstract and Introduction.
- I would move the description of the property “disputed by” to the section Background.
- It would be very helpful if we have examples for the terminology on page 4 lines 4-8.
WD support for the trust process:
The descriptive statistic presented in this section are very informative and provide useful information related to property characteristics. However, it is not clear to me if these are the results or an exploratory analysis. Assuming that these are the results here are my suggestions:
- The authors start by describing the Data. It would be more clear if they add a separate section about Data. Section Data could describe what data the authors use and the descriptive statistic of pages 4-6 about the examined properties (predicate, qualifications and qualifies), properties constraints (could it be possible to explain what is “none” in Table 1), properties qualifiers, and claims. The rest of the data in this section could maybe form the Section “Results”. The subsections 4.1, 4.2, 4.3, 4.4, 4.5 could be the subsection of “Results”. I found it nice that they are split based on the five examined characteristics.
- Footnote 4 states that some data are missing from the initial dataset. I think this is a piece of important information that should be included in the main manuscript. The authors could be more detailed about what and why is missing and the reader will know when the authors mention this again on page 5 line 7.
- I would appreciate it if the authors could include labels for the y-axis in the Figures.
- It would be nice if the authors could add more on what is the difference between “Frequency” and “Accumulated Frequency” in the Tables.
Conclusions:
- It would be more clear if the authors could create sections from the subsections here. After Section Results maybe the Related work could follow, then the Discussion, and finally Conclusions and future work.
- In the subsection Discussion, I expected to get a better understanding of the paper’s results and the authors’ arguments. However, it is still not clear to me how the results evaluate the data in Wikidata. In my opinion, a clear narrative in the Introduction and a methodology plan could really improve the understanding of the paper.
- In the end, the authors argue that “WD’s support for trust decisions about its statements is low and could be improved”. It would be interesting to read here if the authors have any thoughts related to improvements.
Long-term stable URL for resources:
- The authors use a GitHub repository to share their data and scripts.
- The repository includes a README.md file with descriptions regarding the data
|