Review Comment:
The article introduces the concept of “weaker logical status” claims –
claims that are neither true nor false, but rather true in some
context, such as claims that are based on uncertain information or on
competing hypotheses, or claims that have changed over time. It
identifies several methods of representing such information on
Wikidata and investigates their usage on two subgraphs of Wikidata,
one being cultural heritage data and the other being a comparatively
sized sample of astronomical data. Lastly, the authors suggest several
improvements to the current state.
The article appears to be the first such study, and, indeed, one of
the first to introduce the concept of a “weaker logical
status”. However, in its current state, I cannot recommend the
article for acceptance.
The article identifies four approaches used to represent “weaker
logical status” claims in Wikidata. The first such approach is a
distinction between “asserted” and “non-asserted statements”. Out of
the four approaches, this is the only one not defined on the Wikibase
data model, but rather on the RDF representation: a statement is
considered “asserted” if it has a corresponding “truthy” triple
[1]. However, whether or not such a triple is included in the RDF
representation is determined solely by the statements rank and absence
of higher-ranked statements – it is not a concious choice made by
editors. As ranking is discussed as the second approach, this feels
somewhat redundant, in particular since every deprecated statement is
also automatically non-asserted. I would suggest to rather distinguish
between deprecated and non-deprecated non-best-rank statements (i.e.,
normal-ranked statements where preferred statements exists), as is,
indeed, done in Figure 2.
The third approach investigated is qualified statements. The authors
mention that, e.g., P2241 does not have a list of recommended terms on
its discussion page – while true, it should be noted that allowed
values for P2241 qualifiers are still restricted via the “value type”
property constraint, and a list can be retrieved, e.g., via a SPARQL
query [2]. Furthermore, a “one-of” constraint also exists, that, while
deprecated, is still used by the UI to provide suggestions when
editing.
The last approach discussed consideres “null values”. From the
description, I take it to mean the “unknown value” (“somevalue”) of
the Wikibase data model (since these correspond to blank nodes in the
RDF representation) [3]. However, looking at the code, it seems the
analysis rather counts “no value” (“novalue”) statements [4]. While
both could be labeled “null values”, they are not interchangeable:
“unknown value” signifies that a value exists, but it is not known,
whereas “no value” asserts the absence of a value - indeed, this is
claimed as a problem (“For instance, null values are used in some
predicates to represent values that cannot exist, e.g. when signaling
the start (P155: follows + null value) […] in sequences”, “The
subtlety in the semantic differences between providing no value and
providing a null value for a property of a wikidata item, as well as
their other types of applications makes the use of null values
particularly complicated and ambiguous.”). Clearly, the two kinds of
“null values” need to be distinguished in the analysis; in any case,
the description should match what is actually being counted.
As the count of “weaker logical status” claims in Wikidata seems to be
lower than what the authors expected, they compare the relative
portions of the cultural heritage datasets with those of the RKD
images collection, where the portion of “weaker logical status” data
is significantly higher. Since neither of the datasets is contained in
the other (the authors state that about 30000 images from the RKD set
are also present in Wikidata), I wonder whether this might be due to
peculiarities in the datasets (it is also not entirely clear to me
what the RKD dataset contains) – for example, it could simply be that
the RKD dataset contains proportionally more works with, e.g., unknown
creation dates because the proportion of more recent works in Wikidata
is higher (with the assumption being that it's more likely for recent
works to have a known time of creation).
I appreciate that the code used to perform the analysis and the
datasets are available. However, many of the scripts have hardcoded
paths to the data files in them (and thus cannot be run easily without
modifications) – I would suggest to always use relative paths (and to
document what those are). I also suggest providing a pip lockfile
(“requirements.txt”), or at least documenting the required versions of
dependencies used.
Some minor comments:
p3, l3: of the named Knowledge Graphs, only Wikidata seems to be a
“collaborative public platform[s]”.
p3, l16: I am not too familiar with the CIDOC CRM, but how is
“crmn:E13_AttributeAssignment” an n-ary relation? It is my
understanding that it is a class used for reified n-ary relations.
p3, l27: “3% of the total of its visual
artworks”: 3% of the visual artworks in Wikidata, or 3% of the visual
artworks in the RKD?
p3, l36ff: “[18]” is about representing the Wikibase data model in RDF
– while this requires some form of reification, this is independent of
the “logical status” of the claims.
p5, l12: “Statements, independently of rank, […]”: claims can be
decorated, a statement encompasses a claim, references, a rank, and
qualifiers. Furthermore, there are many more qualifiers beyond the
four mentioned [5].
p5, l40: “statements can be associated with a blank node”: the blank
node is an implementation detail of the RDF dump export – indeed the
SPARQL endpoint uses skolemised nodes instead [6]. In terms of the
data model, the special value “unknown value” is used (which is not to
be confused with the other special value “no value”).
p6, l12: “follows + null value”: indeed, this is the other kind of
special value, “no value” – note that, in the RDF representation, this
does not correspond to any node, rather, the statement node becomes
the subject of an “rdf:type wdno:P…” triple.
p7, l31ff: I would suggest moving this paragraph in front of the
description of the individual datasets, as it explains how the numbers
of JSON files come to be (also, since the numbers of entities are
usually not divisible by 50, “exactly 50” should likely be “at most
50”).
p8, l21: “avoid assessing a claim”: how does deprecating a statement
avoid an assessment?
p8, l26: “A datasets” looks confusing. Maybe call the datasets “ANs”
and “ANg”, and refer to both as the “AN datasets”?
p10, l15: “Q5727902: circa qualifier”: circa is the qualifier value
for the qualifier, not the qualifier.
p10, l42: “This is probably the only true WLS use of null-valued
statements”: ironically, this is a modelling error, since a novalue is
used, although it should be a somevalue.
p10, l51: “shifted from the public domain to copyrighted”: rather
shifted from copyrighted into the public domain?
p11, l23ff: what is the distinction between uncertainty qualifiers
(such as “disputed”) and cautioning qualifiers (“attribution”)? This
seems arbitrary to me.
p13, l6f: “if an artwork A was supposedly moved […]”: (i) “artwork A”
is particularly confusing with “A” also being used for the “A
dataset”; (ii) why must both statements be ranked as deprecated? Since
Wikidata is not a primary source, but rather a secondary database,
unless the claim that the artwork was moved is stated elsewhere, the
claim should not be in Wikidata at all [0]. If there were such a
source, I would expect a normal-rank statement, possibly with a
“disputed” qualifier value?
p13, l32: “Provide a list of suggested values for P2241 and P7452”:
Such lists do exist, although not directly on the discussion page, but
in the form of “one-of” property constraints.
p13, l39f: “distinguish […] between […] WLS […] and non-WLS uses”:
This distinction is already present in the form of the
somevalue/novalue special values [3].
p14, l5f: “can be accessed in the Github folder of the project”:
where? I have not been able to locate it.
While the structure of the article is easy to follow, it contains a
number of spelling mistakes and inconsistencies, and some sentences
are hard to follow. A (likely incomplete) list follows; in particular,
the authors should standardise on British vs. American spelling (e.g.,
both “analysed” and “analyzed” are used throughout the article), on
whether or not to use the Oxford comma, and consistently capitalise
“Wikidata”.
p1, l41: “coming by different and disagreeing sources” ~> “coming from different and disagreeing sources”
p2, l1: “enunciates” ~> “statements”
p2, l11: “limits” ~> “limit”
p2, l43: “(2)” ~> “section 2”
p2, l46 “reserach objective” ~> “research objectives”
p3, l12: “manage” ~> “manages”
p3, l13: “models[5]” ~> “models [5]”
p3, l14: “CRM[4]” ~> “CRM [4]”
p3, l17: “Europeana [10],” ~> “Europeana [10]”
p3, l18: EDM needs to be defined
p3, l30: “data model” ~> “data models”
p3, l31: “(dumping)[14]” ~> “(dumping) [14]”
p3, l40: “according multiple” ~> “according to multiple”
p3, l40: “[19] survey” ~> “Piscopo and Simperl [19] survey”
p3, l41: “categorizes” ~> “categorize”
p3, l51: “See also the list […] can be found at” ~> “The list […] can be found at”
p4, l23: “vy” ~> “by”
p5, l29: “P1502” ~> “P5102”
p6, l35: “WLG” ~> “WLS”?
p7, l8ff: “json” ~> “JSON”
p8, l18: “Av” ~> “As”?
p8, l25: “q:P2241” ~> “P2241 qualifier”
p8, l44: “seem” ~> “seems”
p9, l38: “P155:follows” and “P155:followed by” ~> “P155: follows” and “P156: followed by”
p9, l38: “by In” ~> “by. In”
p9, l40: “alternative” ~> “alternatives”
p9, l42: “asserted)”: there is no opening parenthesis
p9, l42ff: three colons in a row, please rewrite this sentence.
p9, l45: “Ag)”: there is no opening parenthesis
p10, l10: “by,P162” ~> “by, P162”
p11, l36: “seems doing” ~> “seems to be doing”
p11, l43: “a particular attention” ~> “particular attention”
p11, l42: “e.g. authorship” ~> “e.g., authorship”
p13, l20: “WLG” again
p14, l15: “would assigned” ~> “would be assigned”
p14, l41: “a overabundance” ~> “an overabundance”
p14, l42: “seem to a large” ~> “seems to be a large”
p15, l4: “loose” ~> “lose”
[0] https://www.wikidata.org/wiki/Help:Statements#Add_only_verifiable_inform...
[1] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Truthy_...
[2] https://w.wiki/6Tpt
[3] https://www.wikidata.org/wiki/Help:Statements#Values
[4] https://github.com/alessiodipasquale/Wikidata_WLS/blob/main/countBlank.p...
[5] https://w.wiki/6TrP
[6] https://www.mediawiki.org/wiki/Wikidata_Query_Service/Blank_Node_Skolemi...
|