A Survey on Automatically Constructed Universal Knowledge Bases

Tracking #: 1726-2938

Bayzid Ashik Hossain
Rolf Schwitter

Responsible editor: 
Jens Lehmann

Submission type: 
Survey Article
A universal knowledge base can be defined as a domain-independent ontology containing instances. Ontologies define the concepts and relationships among these concepts and are used to describe and represent a domain of interest. These knowledge bases are the elementary units for inference techniques on the Semantic Web. The Semantic Web is an extension of the World Wide Web which facilitates software agents to share content beyond the limitations of applications and websites. This survey focuses on the most prominent automatically constructed universal knowledge bases including KnowItAll, YAGO, NELL, Probase, BabelNet and Knowledge Vault. We take a closer look at how these knowledge bases are built, in particular at the information extraction and taxonomy generation process and investigate how these knowledge bases are used in practical applications. Due to quality concerns, the most successful and widely employed knowledge bases are manually assembled to maintain high quality, but they suffer from low coverage, high assembly and quality assurance cost. On the contrary automatic approaches for building knowledge bases try to overcome these drawbacks. Although it is strenuous to achieve the same level of quality as for manual knowledge bases, we found that the surveyed automatically constructed knowledge bases have shown promising results and are useful for many real-world applications.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 19/Oct/2017
Minor Revision
Review Comment:

In the survey paper with the title “A Survey on Automatically Constructed Universal Knowledge Bases” the authors describe information extraction and knowledge acquisition techniques applied during the automatic creation of six knowledge bases (KnowItAll, YAGO, NELL, Probase, BabelNet and Knowledge Vault). The paper is well structured and follows a fixed subsection scheme for a good readability. For each knowledge base system the information extraction and taxonomy creation are presented, followed by a description of the characteristics of the system. Finally, the systems are compared with respect to criteria like the number of facts/relations, accuracy, multilinguality etc. The language is clear and on a professional level. The authors provide six pseudocode algorithms for taxonomy generation as well as listings of extraction patterns. Furthermore, several examples embedded in text support comprehending the system descriptions.
However, while for several techniques comprehensive literature is referenced, a few high-level “paraphrases” make it difficult to understand parts of the system description. I think for readers without detailed knowledge in information extraction and natural language processing it is hard to get an impression of what a system performs for phrases like: “It [..] correlates a disambiguation context” , “building a context by using a context matrix” and “algorithm that employs resource specific properties”. Moreover, the “discussion and review” aspects of the paper could be extended. What does it imply if e.g. a Brill tagger is used? Are there any (dis)advantages? Does one system perform better for special types of sources or relations due to one component or technique? Are there limitations by design?
Nevertheless the paper concept seems suitable as a survey for the application of fact-/relation- and knowledge extraction and other NLP techniques in the scope of universal knowledge base creation.
Considering the fact that there are some issues in the text (see comments below) I suggest a (minor) revision of the paper.

- the (figure) captions could be more expressive. (e.g. Figure 1, in which system is it used, what is the purpose of the pattern)
- Is there really a benefit of showing Algorithm 1? It is already described in detail in the text.
- Wrong space in Footnote 5
- Algorithm 2 and 3 and Figure 3 are not referenced in the text.
- possible typo: “countries, and sports” (the and is also italic)
- this ellipsis is hard to read: p. 6 “Such as OpenEval ...”
- 4.2 what is a context matrix? what is used as context?
- typo: 4.3 - p. 6 “and uses coupled”-> I think the s is not correct
- p. 7: “more than one [candidate] exist[s]” ??
- p. 8: I think “Proctor and Gamble” should be two separate candidates according to the Hearst Pattern in Fig. 3 … the US company is spelled “Procter & Gamble”
- p.8: I do not understand how there can be two candidates for the candidate y_j if it denotes the word at offset position j. There should be only one word at that position? Or do c_1 and c_2 refer to semantic concepts?
- 5.2 What is a horizontal and vertical merge?
- 3rd paragraph of 5.2: the Double turnstile (\models) is not (formally) introduced for the pairs. I suggest to paraphrase it instead.
- Algorithm 5 and 6 are not referenced in the text.
- p.9: 5.2. I assume by grouping and merging you mean the same, but it could be clearer if you stick with “merging”
- p.9: 6 typo: “One major features”
- p.10: hard to understand without NLP background “It [..] correlates a disambiguation context”, “algorithm that employs resource specific properties”.
- p. 11 “each Wikipage d which is redirected from” --> Do you mean “redirected to”? Otherwise I do not understand the sentence.
- 6.3. “BabelNet uses the disambiguation ..” --> This sentence is quite long.
-7. “significantly bigger” --> please provide numbers; according to Table 1 BabelNet seems larger?
-7.1 “KV then finds example sentences” -> how?
-7.1 What are lexicalized DOM paths? I think an example would be helpful here since this seems a unique feature of KV.
Table 3: Are these accuracy values comparable?

-Please review the references, some exemplary errors are listed below (incomplete)
[2] year and conference is missing
[8] journal name is missing
[15] typo “acollection”
[17,24,40] journal/conference name is missing
[31] year and name is missing

Review #2
Anonymous submitted on 03/Nov/2017
Major Revision
Review Comment:

The paper presents a survey on platforms for automatically constructing universal knowledge bases such as KnowItAll, YAGO, NELL, Probase, BabelNet, and Knowledge Vault. The main task performed by these platforms consists in automatically extracting and representing the knowledge coming from unstructured data sources on the Web. This paper can be considered as an introductory text for researchers who need to use these platforms for their purposes.

Even though the paper analyzes the most famous platforms in this area, it lacks in providing an in-depth analysis of the works that used these techniques by highlighting pros and cons. For example, could be useful to know that BabelNet has been effectively used for cross-language content-based recommendations. A table where the platforms are compared according to their application for addressing different tasks (whit pros and cons) could be really useful for the community.

The presentation is fair, and the paper is easy to read. However, for each platform, the descriptions of the components should directly refer to the algorithms reported in the paper. Instead, the descriptions generally do not refer to the corresponding reported algorithms.

The topic is relevant for the Semantic Web community, but a more critical review should be provided of the platforms analyzed.

Review #3
By Afshin Sadeghi submitted on 06/Nov/2017
Major Revision
Review Comment:

Article Title: A Survey on Automatically Constructed Universal Knowledge Bases
Submitted to: Semantic web Journal

Major Revision

Review Comment:
The submitted article presents a review of 6 projects that generate domain independent knowledge bases, additionally, these systems apply semi-automated techniques to create them. They listed techniques used in KnowItAll, YAGO, NELL, Probase, BabelNet and Knowledge Vault projects. Unfortunately, the survey is missing DBPedia, FreeBase and WikiData which are prominent KBs in the Semantic Web. The authors presented the review of each approach of KB generation in three sections: “Information Extraction”, “Taxonomy Generation”, “Characteristics”. In the Discussion section authors gave a comparison of these systems based on the number of entities created by them and the factors of methodology and accuracy. In the conclusion section, authors infer that the paper has indicated features of each KB so that the reader can choose a KB based on her need, with the incorrect assumption that all these KBs are openly available. In this section, authors also specified the main feature of each KB and dataset generation method. YAGO supports temporal and spatial dimensions, BabelNet has a large number of multilingual synsets, Probase provides a probability of correctness for each fact and Nell project can be seen as an autonomous agent that continuously learns new facts.

General comments:

The main reason of my suggestion of major revision:
We have a mix of terms and concepts overall in the article. Especially, the term “knowledge base” which is a central concept in this article is both used interchangeably for a “knowledge base” and a system that administers it and a KB construction approach. They also included an information extraction system as a KB. For example, I see that “KnowItAll” is categorized as a Knowledgebase, and is listed in the surveyed KBs and then it is mentioned in section 2 as a KB system, but it is an information extraction system, which is not a Knowledgebase nor a KB system. In the 7th line of section 2, the term “KB” is used for a system that creates KBs.

Another term is “Information fusion”. Information fusion step of the “Knowledge Vault” is explained in section 7.2 which is a section dedicated to Information Extraction.

I suggest to gain a better understanding of terms and to make a better comparison of the papers, authors first provide a definition of the most important used terms and also the definition of comparison characteristics used in section 8. Then they repeat their survey based on these definitions. They can also insert these definitions in the introductory sections of the article so that the reader becomes a common understanding of terms with authors about the content of the article.

Minor issue 1:
BabelNet and Probase are lexical knowledge resources which I doubt comparing them with ontology-based knowledge bases make a good comparison. Although they include many concepts, they provide only one or two relation types.

Minor issue 2:
Another issue is that the survey mentions parts of original articles without quoting them. Firstly it is expected that a survey provides a critical assessment of the work that has been done and give the point of the view of the author, not the original papers. Secondly mentioning exact phrases needs quoting and citing the original text. An example: page 12 line 3:

“ KV is different from the previous automatically constructed knowledge bases as it combines noisy extraction from the web with the prior knowledge derived from the existing knowledge base”

I see the exact phrase in paragraph 3 page 2 of [1].

Another example is the first paragraph of 5.3 that is generally talking about Probase is repeating the exact wording and deduction of the first paragraph of section “4. A PROBABILISTIC TAXONOMY” in [2]

Another example: line 13 of the right column of page 2: “KNOWITALL uses a novel form of bootstrapping…” is the same phrase of the second line in the second paragraph of page 2 in original paper[3]
I should add that this bootstrapping method maybe was innovating in 2004, at the time this paper is published, but it is not innovative in the year 2017.

This manuscript is submitted as a survey article, therefore, I reviewed along with the usual dimensions for a survey paper:

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
(2) How comprehensive and how balanced is the presentation and coverage.
(3) Readability and clarity of the presentation.
(4) Importance of the covered material to the broader Semantic Web community.

1.Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

This paper is reviewing an interesting topic. A study of universal KBs construction approaches helps the students and researchers to redo and enhance the current methods of extraction and knowledge base construction from multi-disciplinary fields in a large scales. They can further use these KBs to verify their knowledge extraction and analysis.

2. How comprehensive and how balanced is the presentation and coverage.

The paper has not a good balance between coverage and presentation. Important KBs such as DBPedia, FreeBase, WikiData that are famous in the semantic web area are not covered. The article structure has is missing a section to list and explain the classification criteria and describe the parameters of comparison of these KBs. Sections 5.2 and 6.2 give too much details about taxonomy generation.

3.Readability and clarity of the presentation.

Generally, the article is written in clear language. However, it has the following issues that I would address them by each chapter.

Page 3:
Chapter 3 line 5:
“Rather than using any particular information extraction method, YAGO takes advantage of the category information from Wikipedia.” This is indeed an information extraction approach while author infers that Yago is not using an information extraction method.
Chapter 3, second paragraph: just mentioning 97% accuracy is not accurate enough. Linking Wikipedia in which language? Is it an average number? Based on which accuracy measuring formula? I guess you mean testing correctness of facts using “weighted average Wilson center” mentioned in [4]

Page4 :
Second paragraph: “YAGO2 extends the fact extraction approach of YAGO”. YAGO2 is a KB is mentioned as an extraction approach which is not correct.
Paragraph 3: “The main objective of YAGO3 [41] is to construct a...” YAGO3 is a KB not a pipeline.

Chapter 3.1 first paragraph
“...so that new facts from different sources can be added to the knowledge base. For this purpose each fact is tagged with a confidence value between 0 and 1.”
It is not described how assigning confidence values helps adding facts from different sources.This part needs more clarification from the authors.

Chapter 3.1 last paragraph:
“The PCA confidence tells that if a fact that is not available in English Wikipedia then it is not necessarily wrong but merely unknown.”
As far as I know from here[5] PCA confidence is only a formula that gives a threshold to prune generated rules. So I guess authors wanted to explain PCA(partial completeness assumption) not the PCA confidence.

Page 5:
Line 3: I am not sure what “the head compound” is pointing to.
Chapter 3.3
“YAGO uses WordNet to deal with unreliable data.” I am not sure how WordNet helps with dealing with unreliable data. It could be rephrased to be more clear.

Page 7:
First paragraph: “NELL’s applications includes life long learning and building autonomous agents”
I doubt “life long learning” be an application. It could be a specification.

Section 5:
“Probase [25] is a universal, general purpose, probabilistic knowledge base that implements a taxonomy. ”
Probase is not a system, it is a KB.

“ Thirdly, it is the largest general-purpose knowledge base that is constructed automatically from HTML text of web pages.”
The number of facts in Probase is much less than the number of facts in NELL and BabelNet.

Page 8:
“In such a case, Probase finds the…”
Probase is a KB, not a system.

Page 9:
Last line in the right column: “Semantic drift or semantic change can be defined as…”
Semantic drift is previously used in the first paragraph of page 7 but the definition is given here. I suggest providing definitions at the first occurrence of this term.

Section 6: The whole first paragraph of this chapter is one sentence. It is hard to understand. Please break it down into smaller sentences.

5 lines to the end of this page: “These resources often have poorer coverage for non- English language compared to others which in turn make people bias to conduct research in resource-rich languages; for example, English”
The meaning of this phrase is not clear. Please rephrase.

Page 12:
“KV is much bigger than other comparable knowledge bases...”
This sentence is the exactly same sentence from the original paper [1] and has no quotation mark.

Section 7.1 “knowledge base construction problem can viewed as...”
Corrected : can be viewed as

Page 13:

Section 7.3

“KV is used in artificial intelligence applications, machine-to-machine communication, augmented reality, predictive models, and virtual assistant use cases”
As footnote 16 says, KV actually was a research project and did not lead to a product. These are use cases for google knowledge graph and not KV. Although the webpage in footnote 16 it is not the best reference to rely on.

Footnotes 16 to 20 should be on page 14. The link to footnotes in Table 1 is to the first page, not the footnotes.

Page 14:

In Table 1
380M is the number of relations, not relation types of BableNet.
KnowItAll is not a KB and 50000 is not the number of facts in it. This number of facts was extracted by it in an experiment. [3]
It is not clear that column “Methodology” is the extracting methodology, or mapping methodology or the linking methodology.

In Table 3
The KBs are not comparable in term of accuracy. For each of the KBs, accuracy is defined differently. For example, in the YAGO original paper, authors have tested the accuracy of mapping concepts to Wikipedia but in Knowledge Vault accuracy measures how many triples are correctly classified.The sampling and accuracy formula for different KB tests are not necessarily the same and they may have different gold standards.

Section 9
No deduction is made in the conclusion section and only an abstract of the article is given. I suggest authors add their own general deduction and their point of view to this section.

4. Importance of the covered material to the broader Semantic Web community.
The survey specifically covers approach used for generating different domain independent KBs and some of their characteristics which is very related and important topic in the Semantic Web community.

[1] Dong, Xin, et al. "Knowledge vault: A web-scale approach to probabilistic knowledge fusion." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014

[2] Wu, Wentao, et al. "Probase: A probabilistic taxonomy for text understanding." Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012.

[3] Etzioni, Oren, et al. "Web-scale information extraction in knowitall:(preliminary results)." Proceedings of the 13th international conference on World Wide Web. ACM, 2004.

[5]Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In Proceedings of the 22nd international conference on World Wide Web, pages 413–422. ACM, 2013.

Review #4
Anonymous submitted on 09/Nov/2017
Major Revision
Review Comment:

The article "A Survey on Automatically Constructed Universal Knowledge Bases" presents the KBs (knowledge bases) KnowItAll, YAGO, NELL, Probase, BabelNet, and Knowledge Vault. It describes the extraction process for each of those from a high level perspective. It finishes with a comparison of their coverage and quality.

The article is generally well written, and understandable, except for some paragraphs (see minor comments below). It should also mention the shortcomings of the presented knowledge bases. It mentions that most of these KBs are canonicalized, but does not clarify, which ones. It should also analyze, whether they achieve this goal. A non-canonicalized KB might contain too much noise to make it useful in applications. I would like to have more information about how the accuracy numbers in Table 3 were obtained.

The survey contains some small errors. For example on page 4 the article states that YAGO uses PCA confidence to align relations from multilingual Wikipedia pages. Actually it uses the Wilson score. (compare "YAGO3: A Knowledge Base from Multilingual Wikipedias", page 7). On page 5, 3.3, the reader might think that LeMonde is an application of YAGO. The actual application was analyzing the newspaper LeMonde using YAGO.

A major issue with the paper is the omission of one major automatically constructed knowledge base: DBpedia. It is only mentioned once as an application of YAGO. DBpedia was one of the early automatically constructed knowledge bases, and created a big community. Furthermore, the paper should describe Wikidata, or mention why it was not included in the survey.

Finally, the article should mention some future research directions in automatically constructed knowledge bases. It would round up the survey in the eyes of the reader, and might motivate some researchers to investigate open problems. The survey needs to contribute more new information, as in the current form it mainly summarizes the paper of the treated knowledge bases.


Minor comments:
- p. 4, "subsequent extractor" and "follow-up extractor" seems redundant
- p. 7, in "... because only pattern-based information extraction often prevents deep knowledge acquisition", means probably "... because applying only pattern-based ..."
- p. 8, in "If Probase has two candidates for detection ... that y_j \in {c_1, c_2} then ...", the word "that" seems odd. Also an explanation what a candidate is would be helpful.
- p. 8, mentioning the meaning of (x,y) would greatly help the reader. At the beginning I thought it meant "x isA y", but actually it means "y isA x"
- p. 8, the meaning of the |= symbol should be explained, or the symbol should not be used
- p. 8, bottom right: the paragraph could describe vertical merge better. It took me some time to see that the first element (x^i, y) is different from the subsequent ones.
- p. 11, "... for which a mapping was found before that is \mu(d) \not= \epsilon ...", sounds wrong to me. Even after reading that paragraph several times, I have not understood its meaning.
- p. 11, "For each Wikipage ... is mapped to ..." seems grammatically incorrect
- the algorithms should cite the original paper, as the quick reader might otherwise think that they were created by the authors of the survey.