Abstract:
Information related to the COVID-19 pandemic ranges from biological to bibliographic, from geographical to genetic and beyond. The structure of the raw data is highly complex, so converting it to meaningful insight requires data curation, integration, extraction and visualization, the global crowdsourcing of which provides both additional challenges and opportunities. Wikidata is an interdisciplinary, multilingual, open collaborative knowledge base of more than 90 million entities connected by well over a billion relationships. A web-scale platform for broader computer-supported cooperative work and linked open data, it can be queried in multiple ways in near real time by specialists, automated tools and the public, including via SPARQL, a semantic query language used to retrieve and process information from databases saved in Resource Description Framework (RDF) format. Here, we introduce four aspects of Wikidata that enable it to serve as a knowledge base for general information on the COVID-19 pandemic: its flexible data model, its multilingual features, its alignment to multiple external databases, and its multidisciplinary organization. The rich knowledge graph created for COVID-19 in Wikidata can be visualized, explored and analyzed, for purposes like decision support as well as educational and scholarly research.
Comments
Major questions and some minor comments.
Dear authors,
I had a brief look at some parts of the paper and quite a few questions developed during reading.
# Recency
On page 22 you are giving https://www.wikidata.org/wiki/Special:Statistics as a source for active users in Wikidata, but these are defined as doing "any action". https://stats.wikimedia.org/#/wikidata.org lists active editors at 13k per month, which is about half. I am mentioning this, as recency or up-to-dateness -- to the best of my knowledge-- has been verified for Wikipedia (e.g. in comparison to Encyclopedia Britannica), but not been verified for Wikidata (or has it? do you have a reference?). https://stats.wikimedia.org/#/en.wikipedia.org lists 46k active editors for EN Wikipedia. So EN Wikipedia has 3.5 times more editors, but 12 times less pages/items to maintain (7 instead of 90 million). You also mention Wikipedia as being more up-to-date in certain areas. This part should be looked at more carefully. Wikidata was not adopted to replace Wikipedia's infoboxes due to it not being as recent as Wikipedia, see discussion here: https://en.wikipedia.org/wiki/Wikipedia:Wikidata/2018_Infobox_RfC Infoboxes are still growing a lot in Wikipedia.
You also mention DBpedia being made by machines. The main advantage of DBpedia is that it doesn't need the continuous effort of 13k human users. This work has already done by 46k Wikipedians and the quality of Wikipedia edits is very high and timely. So in terms of recency the DBpedia approach to extract it from up-to-date Wikipedia should often be more recent.
Compare:
* EN Wikipedia: https://en.wikipedia.org/w/index.php?title=COVID-19_pandemic_in_New_York...
* Wikidata: https://www.wikidata.org/w/index.php?title=Q88641716&oldid=1405487200 copied on April 1st, 2020 from EN Wikipedia. Note that the numbers are erroneously copied, i.e. one magnitude too small !
* Ad hoc DBpedia extraction: http://dief.tools.dbpedia.org/server/extraction/en/extract?title=COVID-1...
DBpedia is not perfect, of course, but it does quite a good job in reflecting the up-to-date Wikipedia. Of course, this was only one example, but it was the first one I looked at. Maybe I was lucky.
I am sceptical, whether the assumption holds that Wikidata handles COVID-19 data well as it is highly volatile and evolving. Theoretically, it can or could handle such data well, given enough editors are spending the effort. Often, however, the quality of human curation reaches only 80% as after this it gets much harder to contribute (20/80 rule or law of diminishing returns). Quite a few fields in Wikidata were filled using a infobox extractor of unclear quality (https://pltools.toolforge.org/harvesttemplates/ "imported from" in statement metadata, see example). I am unsure, whether this extractor has 1. an update function and 2. I assume that most of the time it is run only once and never updated again (see example above). Did you verify the data used regarding its recency or correctness?
My question would be whether the paper has produced anything conclusive in the direction of recency, i.e. can you really use Wikidata's data to draw solid or reliable conclusions. In my opinion, it could be troublesome to use partial or incorrect information to visualize something and thus make it look more "truthy". Couldn't the many linked data sources (table 5-8) or DBpedia Live/ad hoc be used for verification, comparison? If data is copied into Wikidata, there is always a risk, that it would become stale after a while.
# minor
* table 5-8 are missing a "k" for thousands in the count.
* "to drive Wikidata instead of other systems that represent entities using textual expressions, particularly Virtuoso [19]." Virtuoso is a graph database. You can host anything with it. It is open-source and quite scalable as well, see Wikidata loaded into Virtuoso: https://wikidata.demo.openlinksw.com/sparql
# 5. Conclusions
Reading the conclusion section, I do not get a clear idea what the contribution of this paper is. It mentions that Wikidata is user-friendly to query and that visualisations can be created. What are these "deeper insights" mentioned in the conlcusions section?
For DBpedia, the research methodology was quite clear: everybody found shortcomings and criticized DBpedia heavily and then devised ways to improve it, which made DBpedia better in the end. With this paper, I saw some paragraphs of self-critizism and mentioned shortcomings in the text, but no suggestions on how these can be improved upon or mitigated except for the argument that "the community will take care of these". I am sure that Wikidata could be quite good at integrating data (i.e. naturally, if you load data from different Linked Data sources into one database, you can do very good analytics). I am just wondering, how well it worked now and what worked well and what didn't and what needs to be improved. The conclusion section seems very unspecific on this, but I am quite curious to know.
# 2.3 Database alignment and 5 Conclusions
The conclusion claims that "Wikidata has become a hub for COVID-19 knowledge." This is probably based on section 2.3 Database alignment, where it also mentioned 5302 Wikidata properties used for alignment. DBpedia is also extracting them from Wikidata, trying hard to judge their semantics on extraction. They range from HTML or Wiki links to rdfs:seeAlso to skos: to owl:sameAs, which is quite mixed (as also written in the cover letter). Sometimes Wikipedia articles (and therefore related DBpedia and Wikidata IDs) are quite general and need two or more links with "skos:narrower" semantics, when linking to the same source. Did you gain any insights there, which semantics apply for which of the 5302 Wikidata properties or how to distinguish semantics or use these links? Any insight here would be highly appreciated.
As written before, I only took a brief look at the paper starting with intro and the conclusion section, then selectively some more parts. Sorry, if I am asking questions that have been answered somewhere else in the paper.
Comments addressed
Dear Sir,
I thank you for your comments regarding our research paper. We believe that these comments will ameliorate the final output for this research publication. Please find the answer to all your comments attached with the last pages of the updated edition of this research work. We will be honoured to receive any other comment about our paper.