A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Tracking #: 1141-2353

Michael Färber
Basil Ell
Carsten Menne
Achim Rettinger

Responsible editor: 
Jie Tang

Submission type: 
Survey Article
In recent years, several noteworthy large, crossdomain and openly available knowledge graphs (KGs) have been created. These include DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Although extensively in use, these KGs have not been subject to an in-depth comparison so far. In this survey, we first define aspects according to which KGs can be analyzed. Next, we analyze and compare the above mentioned KGs along those aspects and finally propose a method for finding the most suitable KG for a given setting.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Zhigang Wang submitted on 20/Sep/2015
Major Revision
Review Comment:

The paper aims to systematically analyze and compare five general knowledge graphs, and to help find the most suitable KG for a given setting. As the KGs became more and more popular, such a survey is necessary and will help those beginners to grasp the skeleton quickly. The readability is good. As one survey article, the authors did lots of comparison on their defined criteria (aspects).
However, these criteria seem to be superficial and not in-depth. Most of the criteria are descriptive but not essential, such as “Homepage”, “License”, “Data Formats”, “RDF export”, “LOD registration” and so on. I wonder whether the beginners could find the most suitable KG for their application after reading the paper. Some detailed comments are as follows:
1. One of the most important criteria should be the knowledge quality. How to assess the knowledge quality among those KGs? Two possible metrics are Correctness and Coverage. It is better to construct a golden standard dataset to evaluate the knowledge quality. Such a dataset could contain the top-K queried entities generated from the search engine query log. Using the dataset, a formal evaluation on correctness and coverage will help to differentiate the KGs essentially. The evaluation on the frequent facts and the facts in the long tail are both of great value.
2. The criteria defined by the authors are not in-depth. For example, most of the entities of DBpedia and YAGO are from Wikipedia. But how to choose the most suitable KG for a given application? The essential differences between DBpedia and YAGO are the schema and properties (relations). YAGO also contains the classes defined in WordNet, but just have about 100 well-defined properties. On the other hand, DBpedia contains almost all of the properties generated from Wikipedia Infoboxes. DBpedia tends to be the better choice for a property-oriented application but YAGO for the schema-oriented one. It is not revealed in the authors’ submission. The detailed statistics on classes, relations, instances are necessary.
3. In Section 2, the authors use nearly 2 pages to introduce the definitions of semantic data model and graph model. However, these two definitions seem to contribute little to compare the KGs. Especially, the “Genesis of Semantic Data Models” is redundant. I think a formal definition of Knowledge Graph is a better choice for this section.

Review #2
Anonymous submitted on 20/Sep/2015
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

This submission surveys the development of several well-known open knowledge graphs, including DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. The authors aim to conduct a comprehensive comparison on these knowledge graphs, and provide suggestions for KG choice under different settings. However, my major concern lies in the lack of new knowledge and insight revealed over the course of comparisons.

First, most of the comparison items are too simple and straightforward, such as homepage, version and so on in tables 1 to 7. These information is available online in either wiki or homepages. However, some basic information is missing, such as the size of entities, classes, and relations in different KGs.

Second, instead of spending most of space in listing these items, the advantages and disadvantages of the involved knowledge graphs regarding different problems and tasks should be compared and suggested.

Third, the assessment of KGs shown in Section 6 may need further detailed work and deliver concrete suggestions rather than listing the differences from tables 1 to 7.

Strong points of the submission
The writing of the submission is very easy to follow.

Detailed comments:
Page 1 the last paragraph
It would be great if the authors could state the reasons of the criteria, such as RDF and SPARQL are required.

Page 4 section 3
The reasons that the used criteria are better than those in previous work are not convincing.

Page 5 – 19
The part of comparisons is very redundant and lack of new knowledge and insight about knowledge graphs.

Page 19
The assessment of KGs shown in Section 6 may need further detailed work and deliver concrete suggestions rather than listing the differences from tables 1 to 7.

In a nutshell, the submission tries to summarize and compare the current development of several famous open knowledge graphs. The topic is definitely of great interest to the semantic web community. However, the submission may need in-depth analysis and comparison and present concrete suggestions and conclusions on KG assessment.

Review #3
By Sebastian Mellor submitted on 21/Sep/2015
Minor Revision
Review Comment:

The authors or this survey clearly outline the common and additional features, along with limitations, in 5 of the larger, cross-domain, publicly available knowledge graphs. The authors ultimately suggest that a score can be determined for each considered KG based on the requirements to a particular purpose. Throughout the survey requirements are detailed such that a reader should be able to make informed decisions about the importance of each.

Having sufficient technical experience in most areas discussed in the survey, while also being a relative novice with regard to these particular datasets, I believe that I am a suitable target audience for this survey, my review is written as such.

Firstly, with regards to the Freebase KG, would the authors be able to provide updated information regarding the current state of API and data migrations?

Regarding quality and trustworthiness of the data I feel there could be more explanation of Provenance of facts and Quality ensurance. Re: provenance in particular, could the authors briefly expand upon how one might use the provenance to determine the trustworthiness? Could Table 3 outline the fields that are available for each KG, e.g., userid, source reference? Re: quality ensurance, is it possible to describe each KG in comparable terms rather than no, trusted, depends, 95%? This may not be feasible but I would appreciate an authors response as some guidelines for evaluating quality/trust would also aid in assessing further KG resources.

A key limitation of the available KGs appears to be the varying domain specificity and the lack of descriptions. Table 8 (Decision matrix) does not highlight the Covered domains (Table 1) for each KG, from the author's descriptions users of the OpenCyc KG would have different requirements to the other 4 and thus little choice - is this correct? When referencing table 8 alone, excluding that row could be misleading.

Other questions raised earlier in the survey were subsequently answered such as "What would one be required to do to interface all KGs with SPARQL?". Overall, this survey is well written, readable, and clear. I would only additionally ask that the authors perform some additional proof reading to address any minor spelling or grammatical mistakes such as:

p12, col1, middle "(i) is only duable" -> "doable"
p23, 8. Outlook, "limited extend so far." -> "extent"

The survey also approaches issues related to linking open data such as the requirements to align entities and schemas. It would be good to see future work along the lines of rating the suitability of certain data sets to being successfully (or partially) linked. Clearly much work has already been done in unifying many sources, but the issue of specificity of knowledge covered by particular sources highlights the concern of one single KG not being sufficient to a purpose - although this may be out of the scope of the current survey article.

I see this survey paper as being valuable as a reference when assessing new or additional knowledge graphs and building a more complete overview, and equally valuable as a thorough introduction to available KG data sets and how one might be able to use them. For these reasons I feel the article should be accepted.