Review Comment:
This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: Linked Dataset Descriptions - short papers (typically up to 10 pages) containing a concise description of a Linked Dataset. The paper shall describe in concise and clear terms key characteristics of the dataset as a guide to its usage for various (possibly unforeseen) purposes. In particular, such a paper shall typically give information, amongst others, on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.; metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth; examples and critical discussion of typical knowledge modeling patterns used; known shortcomings of the dataset. Papers will be evaluated along the following dimensions: (1) Quality and stability of the dataset - evidence must be provided. (2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided. (3) Clarity and completeness of the descriptions. Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people. We strongly encourage authors of dataset description paper to provide details about the used vocabularies; ideally using the 5 star rating provided here .
Review for
“Wikidata through the Eyes of DBpedia”
Ali Ismayilov, Dimitris Kontokostas, Soeren Auer, Jens Lehmann, Sebastian Hellmann
I pointed out when I was asked to do the review that I might be regarded (and possibly also actually be) biased in reviewing this paper, due to my relation with Wikidata. I only agreed because my review would be not-anonymous, published, and thus a possible conflict of interest will be obvious and the review would be open to public scrutiny. - Denny Vrandecic
The paper describes the work being done in order to add a Wikidata dump to the DBpedia datasets, and thus to provide the whole universe of Wikimedia related projects in a single, downloadable source, using a single ontology.
In general, the paper does not really address how the two datasets are complementary and I don’t have the feeling that the discussion of the advantages and disadvantages of the two datasets with regards to each other are sufficient and fair. I had the feeling the paper was mostly dwelling one-sidedly on the advantages of DBpedia. A few major advantages of Wikidata that I would expect to see mentioned:
A) If you find an error or omission in Wikidata, you can actually go and fix it instantaneously. There is no such mechanism for DBw.
B) For example, the death of David Bowie - more than half a year ago - is still not in the DBw dataset (as per July 12, 2016 as checked on http://wikidata.dbpedia.org/page/Q5383), but the death was updated in Wikidata within minutes. The paper does not make any mention of the fact that there is, through the way it is designed, a possible and inherent large delay in the freshness of the data.
C) Human-readable IRIs are presented solely as an advantage, without discussing issues of anglo-centricity or the point that the relevant W3C documents suggest to use opaque URIs.
D) There is no mention of the licensing of the DBw dataset, which seems more restrictive than the licensing of Wikidata, if I understand the footnote on the page of Bowie correctly (it says the data is published under CC-BY-SA, whereas Wikidata uses CC0. The VOID file at http://wikidata.dbpedia.org/downloads/void.ttl does not state anything about the license).
I recommend a major revision, because I would really like to see a more fair and balanced comparison of the two datasets. I would like to be able to point to this paper when I get asked for a discussion of the relative merits of the two datasets, and currently I would not feel comfortable with doing that.
Most of the following points are details that are easy to fix. I am happy to have a public conversation on any of the points raised - I expect the authors to defend a few of the points I call out. But most of the following points are rather obvious and should simply be fixed.
Section 1:
1) “Wikidata … developed its own data model” - I do not understand what you mean with “its own data model”, as the data is being exported and provided in RDF. What is the sense of the term “data model” so that Wikidata has its own, but DBpedia does not? Isn’t it both just “RDF vocabularies” or “OWL ontologies”?
2) “The multilingual DBpedia ontology, organizes the …” remove comma
3) Structure: You claim that the DBpedia ontology organizes the data, while Wikidata is schemaless. I am not sure I understand the difference. Both DBpedia and Wikidata use an RDF vocabulary / OWL ontology. Why is the one schemaless, whereas the other organizing?
4) Curation: “Wikipedia authors thus unconsciously also curate the DBpedia knowledge base” - I sure don’t hope so ;) - remove “unconsciously”, add “as a side effect”.
5) Curation: don’t split it “Medi-aWiki”, but rather “Media-Wiki” (or not at all)
6) Curation: the point omits the fact that Wikidata data is being widely automatically compared to Wikipedia content, and has partially a very high visibility (by virtue of being directly displayed in Wikipedia), and that DBpedia - in case of an extraction error - does not allow for direct curation.
7) Publication: this is listed as a difference between DBpedia and Wikidata, but to the best of my knowledge this seems pretty equal between those two. Can you elaborate?
8) Coverage: “there is no study yet that performs a qualitative and quantitative comparison”. Take a look at http://www.semantic-web-journal.net/system/files/swj1141.pdf
9) “We argue that the result of this complementarity” - you described a few differences, but I didn’t really see how they are complementing each other at this point. I would like to see that improved to actually focus on the complementary strengths of each other.
10) “Wikidata would be better integrated into the network of Linked Open Datasets” - That’s a claim, but why would that be the case? What is currently missing for integration into LOD?
11) “...and Linked Data aware users had a coherent way …” I am not sure if you are trying to say they “would have a coherent way” or “if they had a coherent way”. Also, isn’t there a coherent way already? HTTP and related Semantic Web standards?
12) “...the right balance between coverage and quality.” The paper does not discuss quality of the two resources, and also how a user of the datasets could actually choose between coverage and quality. I would like to see this argument expanded.
Page 2
13) Figure 1’s description has a footnote 5, which does not exist.
14) “While DBpedia has a … commonly used ontology” -> replace “commonly” with “widely”, and also, where is the ontology used? (i.e. citations needed)
15) “... people face difficulties when confronted with … Wikidata schema.” -> Reference needed that this is indeed the case.
16) “As a result, with the DBpedia Wikidata (DBw) dataset can be queried with the same queries that are used with DBpedia.” rephrase
Section 2
17) After the first Wikidata and before citation [7] there is too much whitespace
18) “Wikidata is community-created knowledge base” Add “a”
19) “... and more than 2.7 million registered users…” add “has”. Also the number 2.7 million is highly inflated (it is the number of registered users over all Wikimedia projects, and includes in particular also spam accounts), and the number of active users (about 6,700) is much more interesting. Source: https://stats.wikimedia.org/wikispecial/EN/TablesWikipediaWIKIDATA.htm
20) “... and an optional reference.” -> “and one or more optional references”, i.e. can be more than one references.
21) “No value” marker means…” add “the”
22) “unknown value” marker means…” add “the”
23) “but exact value not known” -> “but the exact value is not known”
24) Footnote 6 seems to be wrong. The text talks about custom value, but the footnote defines a SPARQL prefix.
25) Listing 1, description: “Douglas Adams” -> “Douglas Adams’ “, add apostrophe at end
26) In the second paragraph, after DBpedia and before reference [5] there is a lot of whitespace.
27) “The DIEF is able” -> I would drop the “The”
28) “... and allows the easier integration of different Wikipedia language editions.” -> easier than what?
Section 3
Page 3
29) Section 3 says that it described design decisions. What is the design decision in the second paragraph, titled “Re-publishing minted IRIs as linked data”? What are the alternatives, i.e. what did you decide against?
30) “... most companies … keep the datasets hidden”, and you list Freebase as an example hidden dataset. What is hidden in Freebase?
31) “Datatype support in Wikidata started at the end of 2015” -> Maybe I misunderstand, but datatypes have been introduced to Wikidata in February 2013 IIRC.
Section 4
32) “...to map, in real-time, Wikidata…” - isn’t Wikidata mapped from the dumps? What do you mean with real-time?
Section 4.1
33) “... we define Wikidata property to ontology mappings.” -> “to DBpedia ontology mappings.”
Section 4.1.1
34) “... and at the same time crowd-source the DBpedia ontology.” -> Section 1 states that the DBpedia ontology is “relatively stable”. It sounds to me that crowd-sourcing and the promise of stability are highly contradictory. What am I missing?
Page 4
35) Fig. 2 has a box labeled with “Virtuose”. Should be “Virtuoso”
Section 4.1.2
36) “The value transformation … as functions.” Sentence is ungrammatical.
37) “$2 replaces the placeholder with a space the wiki-title value…” Sentence is ungrammatical
Page 5
38) “If mappings for the current Wikidata property exist…” - and if not?
Section 4.2
39) “If a DBpedia class is found, all super types are assigned…” Does this follow only the DBo or also Wikidata’s P279?
40) “After the redirects are extracted, a transitive redirect closure (excluding cycles) is calculated” -> are there any cycles? There shouldn’t be. Is this reported?
41) “The first step is performed in real-time…” As above - what does real-time mean here?
Section 4.3
42) “We append an additional hash on the IRI” - a hash of what?
Page 6
43) Table 1, row “-Other”, “Wikidata statements DBpedia ontology”, sentence incomplete
44) Table 1, “Mapping Errors” list 2.9M errors. Why so many?
45) “Aliases label and descriptions” - add one or two commas.
Section 6
46) The title of the Section is “Statistics and Evaluation”, and whereas I see plenty of statistics, I didn’t see much of an evaluation.
Page 7
47) Table 6, would be more useful if you also provided the labels for the properties
48) The count in Table 6 uses a comma as a separator instead of a dot (as in Tables 3 and 4).
49) “Wikidata does not have … date paoperties” -> properties
50) “most frefuent at the moment” -> frequent
51) “We generated [854k] redirects - including transitive.” How many of these were transitive? I would not expect many in Wikidata, that is why I am asking.
52) “According to Table 2, a total of 2.9M errors originated from the schema mappings and 42k triples did not pass” - Could you provide a bit more insight into the nature of these millions of errors? Are these problems in Wikidata, the mapping, in DBpedia?
Section 7
53) “The DBpedia publishing workflow guarantees: a) long-term availability through the DBpedia Association” - Are you expecting the DBpedia Association to be able to guarantee a more long-term availability than the Wikimedia Foundation?
54) “b) agility in following best-practices as part of the [DIEF]” - I am trying to understand this sentence, but I fail. What does it mean?
55) “In addition to the regular and stable releases of DBpedia we provide more frequent dataset updates from the project website./footnote{http://wikidata.dbpedia.org/downloads}” - What is the frequency of the regular releases? I checked the given URL (on July 12, 2016), and the three downloads there were named 20150307, 20150330, and 20160111 - so the last update was more than half a year old, the one before was 10 months earlier. How often is “more frequent”?
Page 8
Section 8
56) “Since DBpedia provides transitive types directly, queries where e.g. someone asks for all ‘places’ in Germany can be formulated easier.” -> The experience with the Wikidata SPARQL endpoint shows us that the materialization of transitivity seems to have hardly an effect on the ‘easiness’ of query formulation, given that the endpoint supports transitivity. I would like to see some support for this claim before seeing it published.
57) “Finally, the DBpedia queries can, in most cases directly or with minor adjustments, run on all DBpedia language endpoints.” - What is the advantage of that? Sure, I understand, you can take a query from the French DBpedia endpoint and run it on the Greek endpoint, but why would you? Wouldn’t a single unified dataset with all the knowledge be more useful for most applications? For what application would this be an advantage?
58) Listing 5: In #DBw, can I not use en:Germany instead of dw:Q183?
59) Listing 5: In #wikidata, you can also use the standard-conforming FILTER (LANG(?label)=’en’) instead of the SERVICE call. But if you insist on using the proprietary SERVICE call, it would make much more sense to use an example with at least three different variables, or else the advantage of the SERVICE call is not visible.
60) Also, in the #wikidata query, you probably would like to use wdt:P31/wdt:P279* as the property - you forgot the *, or else the answers won’t be comparable (I assume that you materialize the whole transitive closure in DBw).
61) Listing 6, #DBw: are you sure the first predicate is rdf:statement, and not rdf:subject?
62) Also, wouldn’t it make sense to simplify the Wikidata query to
?person p:P26/pq:580 ?marriage_date
Instead of the three lines for the triple pattern in DBw we would have a single line in Wikidata.
63) “Converting a dataset to a more used and well-known schema, it makes it easier to integrate the data.” remove “, it”. Also, for the claim that DBpedia is still more used and better known than Wikidata, I would like to see some supporting material for that.
64) “The fact that datasets are split according to the information they contain makes data consumption easier when someone needs a specific subset” - I didn’t see this split mentioned anywhere in the paper. Care to elaborate?
65) “... and fill in semi-structured data that are being moved to Wikidata.” Only for the data that is moved to Wikidata? You are not planning to use the data that is originally entered into Wikidata, and has not been moved from Wikipedia? How do you keep track of whether some data has been moved or has been originally entered into Wikidata?
66) “It is also plan of short-term plan ...” - add “a”
67) “...to fuse all DBpedia data into a single namespace…” - What does this mean? Given that the dbo-namespace uses the labels of the properties, like dbo:country, if you merge them in a single namespace, does that not lead to namespace clashes?
Section 9
68) “...the daily number DBw visitors...” - add “of”
69) “Which indicates that this dataset is heavily used” - how do you figure that? I mean, which numbers constitute heavy use, medium use, light use, and how did you decide that?
Page 9
70) Reference 3: unicode problems in the authors names, also rdf is not capitalized.
|