Review Comment:
The paper reports experiments run over the COVID19-related content of Wikidata, in order to identify data quality issues (i.e. incorrect or missing triples).
The main contribution is the experiments reported in Sections 5 and 6.
Section 5 identifies data quality issues on a semantic basis, leveraging the so-called "statements" of Wikidata.
A variety of heuristics are applied (leveraging for instance frequent classes of subjects/objects of a given property, or inverse property statements) in order to identify either triples that are likely to be incorrect, or possibly missing information.
In contrast, Section 6 identifies numerical assertions about the pandemic (raw statistics, rates, etc) that are likely to be incorrect, based on external epidemiological knowledge.
In both case, the missing and/or possibly incorrect triples are identified via SPARQL queries designed for the occasion, executed over the Wikidata endpoint.
I am not an expert, but I suspect these experiments may be interesting to the SW community, since large collaborative knowledge graphs like Wikidata are notorious for their inconsistencies.
I found Section 6 particularly interesting, in the way additional knowledge (about the pandemic) is exploited to identify data quality issues.
On the other hand, I found the introduction of these experiments quite confusing.
A large portion of what precedes (i.e. up to page 11) is only loosely related.
In other words, an important amount of contextual information (sometimes with exhaustive tables) is provided, which is not used in the paper.
In particular:
- about COVID 19,
- about Wikipedia: list of prefixes, moderation system, list of buttons of the SPARQL GUI interface, etc.
- about SPARQL: list of clause and aggregate functions, syntax of filtering conditions, etc.
- about ShEx: syntax, format of identifiers, etc. (unless I missed it, no actual use of ShEx was made in the reported experiments)
- about validation techniques that could in theory be implemented, but fall out out of the scope of the paper.
At times, it seems like the article was designed as a generic introduction to a series of semantic web standards/resources.
I doubt this is needed, especially in the SWJ.
For instance, there are already good introductions to SPARQL and ShEx available elsewhere
This has the effect of diluting the contribution, and requires the reader to manually filter out what is unnecessary.
I would suggest retaining only what is useful (for instance, in Table 1, list only the Wikipedia constraints that have been exploited), and move the rest to the appendix.
I also feel like some simple definitions are missing in order to understand the experiments reported in Section 5 (Section 6 in comparison is is very clear).
One needs to go through the (useful) examples provided later on in order to reconstruct what "use case", "supported statements", "scheme" or "logical constraints" mean for instance.
And even then, there is a lot implicit.
As a sonsequence, it is sometimes difficult to interpret the results.
For instance, from the examples that are provided, I suspect that Task T2 identifies missing statements of the form:
p_1 "inverse property" p_2
on a statistical basis, whereas task T3 identifies missing statements of the form:
o p_1 s,
when both:
s p_2 o,
and:
p_1 "inverse property" p_2
are present in the knowledge graph.
But this is only blind guess, because this is not made explicit in the paper.
Overall, the ambiguity could be significantly reduced by adopting a slightly more formal notation, and adding a short section with preliminary definitions.
Similarly, since the approach is based on so-called "logical" constraints, it would be useful to clarify (even informally) which semantics the authors adopted.
Among others:
- the article refers to "statements" of the form (C_S, P, C_O), where C_S, and C_O are classes, and P is a property.
But it is unclear what these statements are: they do not seem to appear in Wikidata, neither in the Semantic Web standards (RDF/RDFS/OWL) that apparently inspired the Wikidata "statements" (unless P is a meta property like "subClassOf").
It is also unclear how the authors interpret such statements (e.g. it could be a combination of domain + range, or instead OWL-like qualified range restrictions, etc.).
- "inverse property" statements are treated as constraints, which departs from their traditional interpretation in Semantic Web standards (also surprisingly, "equivalent property" statements did not get the same treatment).
- the assumption is implicitly made (but sometimes only) that an item has at most one class (or maybe several via the transitive closure of "subClassOf"?).
All these choices are interesting.
I have no doubt that they are well motivated, and they may be meaningful contribution on their own.
But because some of them are unconventional (at least according to SW standards), it would be useful to make them explicit.
### Suggestions
- Page 1:
The first paragraph mostly consists of unnecessary information.
For instance "zoonotic", "characterized by the onset of acute pneumonia and respiratory distress", "frequently compared to the 1918 Spanish Flu", "distribution and storage challenges".
I guess this can be safely dropped .
- Page 3, second paragraph ("knowledge graph evaluation is therefore necessary"):
Most of this paragraph can be moved to the state of the art section.
- Page 3, last paragraph ("The data structure employed by Wikidata"):
This is very useful.
But I would suggest moving it to Section 3 (next to Figure 2), in order to have everything about the Wikidata data structure in one place.
- Page 4, second column, second ("As a collaborative venture") and third ("Much as Wikipedia ...") paragraphs:
The information provided in these two paragraphs is arguably not needed in this paper.
I would recommend dropping them.
- Section 3:
The structure of this section is confusing:
. The 3 first paragraphs belong to the state of the art.
. Then comes the description of the Wikipedia data structure.
. The end of the section introduces ShEx.
I think readability could be improved by reorganizing this content.
For instance:
. Have a clear state of the art section (combines content from Sections 2, 3 and 7), possibly split in subsections.
. Have a section/subsection that exclusively describes the Wikidata data structure (combines content from Sections 2 and 3). This is important in order to understand the verification techniques proposed in Sections 5 and 6.
. Drop the introduction to ShEx (not used anywhere in the paper).
- Page 5, Figure 1:
I do not see a reason for including this figure in the paper (this is the workflow of a completely different approach to data quality), neither the link to the source code.
- Page 5, "non-relational statements cannot have a Wikidata item as an object" and "objects of relational statements are not allowed to have data types like a value or a URL":
I think I roughly understand what this means ("non-relational" means statement whose object should be an RDF literal).
But for clarity, it would be useful to define explicitly what "relational" and "non-relational" means (again, in a short section with preliminary definitions).
In particular, it is unclear to me whether a "relational" statement requires an IRI as object, or more specifically a Wikidata item.
- Page 6, second column, second paragraph ("These statements can be interesting"):
The description is vague, and it is unclear whether the techniques that are mentioned in this paragraph have been implemented, either by the authors or by someone else.
If this is part of the state of the art, then make it explicit.
If this is instead an announcement of the results presented in Section 5 (I suspect this is the case for the last sentences, about inverse statements), then make it explicit.
If this is none of the two, then this paragraph can safely be dropped.
- Table 1:
No need for an exhaustive list here, only the constraints used in Sections 5 and 6, if any (the rest can be moved to the appendix).
- Page 8:
I do not see the point of providing the syntax and semantics of ShEx, since it is not used anywhere in the paper.
I do not see the value of Figure 4 in this paper either.
- Section 4:
Similarly to the presentation of ShEx, the purpose of this section is unclear to me.
A list of SPARQL operators is provided, together with their (informal) semantics.
Syntactic details are even added, for instance:
"the variables in SPARQL are preceded by an interrogation mark and are not separated by a comma".
But this information is not exploited in the paper (aside from the SPARQL queries provided in appendix).
As far as I understand, it would be sufficient to:
- say that SPARQL is a SQL-like query language designed for RDF graphs,
- mention that it also allows retrieving higher-order entities (such as classes and properties),
- provide a link to the SPARQL specification (and possibly to a good introduction).
Similarly, details about the GUI of the Wikidata SPARQL endpoint and the Wikidata prefixes would be relevant if the paper was a tutorial about querying Wikipedia with SPARQL.
But this is clearly not the case.
So I think these details can safely be dropped (together with Figures 5 and 6).
- Page 11:
"a similar protocol fully based on logical constraints fully implementable using SPARQL queries to infer constraints for the assessment of the usage of relation types (P) on Wikidata based on the most frequently used corresponding inverse statements (C_O , P^-1 , C_S ).":
This is very difficult to parse.
Also the repetition of "constraints" does not help (it is unclear whether both uses refer to the same thing).
Again, this could be easily fixed by adopting a slightly more formal notation.
- Page 13, "logically accurate":
Maybe "semantically accurate"?
Or just "correct"?
- Section 6:
A small table with the meaning of the different codes (c, l, r, m, mn, mx, R_0) would improve readability.
- Page 20, second column, and page 21, first column:
Unless I missed something, these two columns essentially say that statements of types R_0, mn and mx cannot be reliably (un)validated.
If so, I wonder if it is worth mentioning them in the paper at all.
- Section 7:
The state of the art goes into many direction.
I suggest dropping the generic considerations about machine learning, IoT, XML, etc.
It feels like an enumeration of loosely connected topics, which is detrimental to the argument in my opinion.
Instead, the section could focus on a comparison with alternative approaches to data quality assessment in knowledge graphs (some of which have been introduced in Sections 2 and 3).
- Page 23, "the function of logical conditions should be expanded to refine the list of pairs (lexical information, semantic relation) to more accurately identify deficient and missing semantic relations":
This is quite abstract, it could either be made more precise or dropped.
- Page 23, "Big data is the set of real-time statistical and textual ...":
Not sure that generic consideration about big data are needed here (arguably out of scope).
- Page 23, word embeddings, latent dirichlet analysis, Hadoop and MapReduce:
Again, not much to do with the content of the paper, can safely be dropped.
### Remarks
- Page 3:
The phrasing of the contributions can be misleading.
. "We introduce the value of Wikidata ... ": does "introduce the value" mean "argue in favor of using"?
. "we cover the use of SPARQL to query this knowledge graph (Section 4)": "cover the use" suggests some state of the art.
Instead, Section 4 consists of a brief description of the SPARQL language, and some information about the Wikipedia endpoint (and prefixes).
. "we demonstrate how logical constraints can be captured in structural schemas": "logical constraints" can be misleading (there is no logic here) as well as "demonstrate" (no theoretical result). Maybe something like "we empirically illustrate how semantic data quality issues can be identified via SPARQL queries"?
. "used to [..] encourage the consistent usage of ...": I guess this refers to some form of automated suggestion, but I did not see anything directly related in Section 5.
- Page 6, "to its corresponding Wikidata item":
This is too vague.
Let "p" be a property.
If I understand correctly, in a statement of the form:
p "subject item" c
the object "c" is (one of) the expected class(es) of "s" in statements of the form:
s p o
Again, a slightly more formal notation would really help.
- Page 6, "missing Wikidata statements (C_1 , P, C_2), which are implied by the presence of inverse statements (C_2, P^-1, C_1) in other Wikidata resources.":
This is arguably confusing.
In Figure 2, there is no statement of the form (C_1 , P, C_2) where C_1 and C_2 are classes.
Maybe what is meant here is statements of the form:
s p o
and
o q s
where q is declared as the inverse of p.
If this is the case, then it should also be made explicit that "inverse property" statements are understood as constraints, i.e. if two statements:
s p o
and
p inverseOf q
are present, then
o q s
should also be present.
This is fine, but needs to be discussed a little bit, because it significantly departs from the usual interpretation of "owl:inverseOf" statements, and more generally from the design of RDFS and OWL.
Traditionally, the third statement would not be considered as missing, but omitted on purpose, because it can be inferred from the two others, which makes it redundant.
In other words, by design, RDFS/OWL statements (which are very similar to the Wikidata "statements") are meant to derive additional information, not to identify missing data.
It is also unclear to me why "equivalent property" statements were not used in Section 5 similarly to "inverse property" statements to identify missing triples.
- Page 6, "and several description logics for the usage of the property":
There is no description logic here.
- Page 11, "inverse statements (C_O , P^-1 , C_S)":
Same remark as above, these do not seem to be Wikidata statements.
This is particularly confusing here, because this suggests that classes are associated to a property (or its inverse), whereas Figure 7 suggests that classes are associated to the subject and object of a statement.
This needs to be clarified.
- Page 11, "of P(S,O)":
First time that this prefix notation is used in the paper.
Again, a section with preliminaries would help.
- Table 3, "P:(C_S ,C_O) pairs":
Yet another notation.
Unclear how it relates to the "(C_S , P , C_O) statements" mentioned earlier.
- Table 3, "corresponding to each common use case":
This is unclear.
Which "use cases" does this refer to?
Also the footnote does not help ("A set of conditions" is too vague).
- Table 3, "corresponding to the most common (C_S , P^-1 , C_O )":
"Corresponding" is unclear here (define it formally).
- Page 12 "the effectiveness of the use of logical constraints to generate conditions for the verification and validation of the use of relation types":
Difficult to parse.
It is also unclear what "conditions" means (again, a more formal description would help).
- Page 12, "we used logical constraints" and "the use of logical constraints":
It is unclear what "logical constraints" refers to.
Are these constraints expressed in some logic?
- Page 13:
Again, define "use cases" (are these Wikidata triples, or something else?)
- Page 16:
Clarify what "False positive" and "True positive" mean in this context.
- Section 6:
It seems like what is called "statement" in this section is called "relation" in the previous sections.
- Page 21:
the test refers to Tasks M1, M2 and M3 as if they were previously introduced, but I could not see where (the only other mentions seem to be in appendix).
- Page 22, "These tasks successfully address most of the competency questions, particularly conceptual orientation (clarity), coherence (consistency), strength (precision) and full coverage (completeness)":
It is unclear what "the competency questions" refer to here.
Also this is arguably a bold claim.
The verification techniques that have been implemented are interesting, but I doubt they cover most COVID-19 related quality issues in Wikidata.
- Page 22:
"rule-based" can be misleading here.
The term usually refers to some form of automated reasoning (typically deductive).
But there is no reasoning involved in what is described in Sections 5 and 6 (only query evaluation).
- Page 22:
"software tools" is unclear.
I guess what is meant here is "reasoners" (whose execution can indeed be costly).
- Page 22, "depends on the requirements and capacities of the host computer".
So does SPARQL query evaluation (the triple store is hosted somewhere).
### Questions
- Page 3, "allowing embedding":
What does "embedding" mean here?
- Page 3, "fast-updating dynamic data reuse":
I do not understand what this means. Is there maybe a typo?
- Page 5, "for the reformulation of a query":
Why "reformulation" and not "generation" for instance?
- Page 11, "These constraints can be later used to define COVID-19-related Wikidata statements":
"Define" is unclear.
Do this mean "write"?
Or is this an automated procedure?
- Page 11, "disease is the subject class (C_S) and medication is the object class (C_O)":
This implies that subject and object both have a class, and that each has only one class.
Is this really the case?
- Page 13, "72 percent or more of the supported statements":
What does "supported" mean here?
- Page 13, "the medical logic being entered":
Which logic, and entered by whom?
- Page 14, "successfully sorted":
Sorted by what (number of occurrences)?
- Page 14, "three relations had clear inverse properties":
If I understand correctly, this means that the "inverse property" statement for these was not present in Wikidata, but could be inferred to hold statistically?
Or does this mean something else?
- Page 23, "should not only be restricted to rule-based evaluation but also to lexical-based evaluation:
This is confusing.
Does this mean "restricted further", or instead "expanded to"?
### Typos
- Page 4:
"their nature" -> "the nature"
- Page 4:
"encyclopedia, Wikipedia" -> "encyclopedia Wikipedia"
- Page 4,"The system, therefore, aims":
remove "therefore" (does not follow from the previous sentence).
- Page 4:
"property suggesting system" -> "property suggestion system"
- Page 5:
"(Red in Fig. 6)" -> "(Red in Fig. 2)"
- Page 5, "Another option of validating biomedical statements ...":
split the sentence in two (convoluted).
- Page 6, second column:
I guess "C_S" and "C_O" stand for "C_1" and "C_2" (or conversely).
- Page 6, "equivalents in other IRIs" -> "equivalent IRIs".
Or better, "equivalent properties" (since "inverse properties" is used in the same sentence).
- Page 6:
"some of erroneous use" -> "erroneous uses"
- Page 7, "As shown in Fig. 2, a property constraint is defined as a relation where the property type is featured as an object":
It seems like in Fig. 2, what is called a "type" constraint is a constraint on the subject, not the object.
- Page 12, "in order to omit statements that are not widely used in Wikidata":
Should "statement" be "property" here?
- Table 9, caption:
The beginning is copy-pasted from the caption of Table 8, it should not be there.
- Page 23:
"its natural language information of a knowledge graph item" -> "its natural language information"
|