Construction of Knowledge Graphs: State and Challenges

Tracking #: 3386-4600

Authors: 
Marvin Hofer
Daniel Obraczka
Alieh Saeedi
Hanna Köpcke1
Erhard Rahm

Responsible editor: 
Cogan Shimizu

Submission type: 
Survey Article
Abstract: 
With knowledge graphs (KGs) at the center of numerous applications such as recommender systems and question answering, the need for generalized pipelines to construct and continuously update such KGs is increasing. While the individual steps that are necessary to create KGs from unstructured (e.g. text) and structured data sources (e.g. databases) are mostly well-researched for their one-shot execution, their adoption for incremental KG updates and the interplay of the individual steps have hardly been investigated in a systematic manner so far. In this work, we first discuss the main graph models for KGs and introduce the major requirement for future KG construction pipelines. Next, we provide an overview of the necessary steps to build high-quality KGs, including cross-cutting topics such as metadata management, ontology development, and quality assurance. We then evaluate the state of the art of KG construction w.r.t the introduced requirements for specific popular KGs as well as some recent tools and strategies for KG construction. Finally, we identify areas in need of further research and improvement.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 29/Mar/2023
Suggestion:
Reject
Review Comment:

# Summary
This paper covers the state and challenges of Knowledge Graph construction focusing on the whole cycle for constructing Knowledge Graphs e.g. data extraction, entity linking, construction of RDF or PGM, quality assurance, etc. They provide a set of requirements they consider, main tasks and their proposed solutions, existing efforts and tools, and discuss open challenges.
General comments
Suitability as introductory text for readers to get started in the domain
The introduction is targeted at a wide audience in the beginning, but is rather short to get a comprehensive introduction already of the domain. Moreover, the author’s problem is already tackled in work of the existing State of the Art which they do not include.

## Comprehensive and balanced presentation and coverage
This paper covers the states and challenges of Knowledge Graph construction as a survey. However, it lacks a proper survey methodology on how the authors conducted their search for relevance work and what they consider or do not consider in this survey. A detailed methodology is extremely important for readers who start in the domain to understand which parts of the domain are covered and which are not.

## Clarity and readability
The paper is clearly written but sometimes lacks explanation or a definition regarding new terms the authors introduce which makes it harder for readers, unfamiliar with the domain, to understand what the authors present.

## Importance to the broader community
The authors highlight interesting challenges for Knowledge Graph construction, but missed relevant related work which already address some of these challenges.

## Provided data and its sustainability
I haven’t found any link to any data repository for this paper, so I assume there is none. This is understandable as there are no experiments performed so they do not have to be reproduced.

# Detailed comments

## Abstract
The abstract is clearly written and contains properly the context, need, and what has been done in this paper to address the need. The abstract focuses on a need for incremental Knowledge Graph construction and the lack of investigation in the Knowledge Graph construction steps. However, both of them are already being addressed in the State of the Art [1, 2]. These relevant works are not considered by the survey.

## 1. Introduction
The introduction describes the problem the authors are facing and what they did to tackle it. An outline is provided for the reader to know what to expect in the next sections.

(C1) The introduction is rather short to be introduced in the domain. I expected here at least a set of research questions the authors wanted to answer by this work to situate the work and what to expect from it as a reader who is unfamiliar with the domain.

(C2) The authors refer to several existing surveys, but do not include relevant surveys such as [3] where an in-depth analysis was performed on Knowledge Graph construction from (semi-)structured data. This survey covers several approaches and tools which this work is missing e.g. Function Ontology (FnO) [4], SPARQL Functions [5], FunUL [6], etc.

## X. Definitions
(C3) The paper would benefit from a separate Definitions section in which the authors define the terms they use throughout the paper.

(C4) For some terms, the definitions is not provided in the paper which is a problem for readers who are unfamiliar with them.

## X. Survey methodology
A survey paper requires a proper comprehensive methodology the authors followed to provide an overview of the domain. This way, readers know how the authors established their work and why certain works are included or excluded.

(C5) However, this survey paper lacks such methodology which makes it hard to understand why certain relevant works are not mentioned in the paper.

## 2. KG background and requirements for KG construction

(C6) The authors assume in this section that KGs are always performing a physical data integration, but that is not true. KGs can be materialized from existing data, or constructed through virtualization which does not perform a physical data integration, but rather provides a virtualized access interface to the KG.

(C7) In contrast to what the authors state in this work, KGs are not a centralized graph-like representation of data, the power of KGs is rather that they can refer to other entities, outside their own datasets. This way, KGs are never a centralized graph-like representation, but a decentralized one.

(C8) I agree with the authors about data warehouses requiring manual intervention to evolve their schema, but so with KGs. KGs also need manual intervention to create ontologies for a KG, implement the mapping from data sources into a specific schema and also handle changes in the schema e.g. foaf:name → schema:name. Using KGs does not solve this problem currently.

(C9) I would appreciate an extensive description for Figure 1. Besides the short description, I would like to see what is interesting about the Figure 1, what are the key points that the reader must remember when they only see the figure?

(C10) ‘Entities have an unique identifier’, but what about RDF blank nodes? They are only unique inside a KG. The impact of blank nodes are not considered in this work while they are used a lot in KGs.

(C11) I lack the ‘why’ when the authors are stating the following: ‘common ontology relations such as is-a (subclass-of) and has-a relationships should be supported’. While I agree, that is-a and has-a relationships are useful, I don’t find a strong argument in this work to support that.

(C12) ‘Temporal development of a KG’ is mentioned as a desired feature, which I fully agree on. Throughout the paper I never found any related work regarding this topic, while there is [7, 8, 9, 10].

(C13) The authors claim that RDF is a ‘data exchange format of the Semantic Web’ which is untrue. RDF is a framework/data model to present data on the Web in graph-like fashion [11]. Turtle, RDF/XML, N-Triples, etc. are all serialization formats of RDF. Later on in the section, the authors correctly mention these serialization formats as formats, but RDF is not a format.

(C14) Use proper terminology: Entities in RDF do not have a ‘URI-based identifier’, but have an IRI as subject [11]. Moreover, ‘head’ and ‘tail’ part of a triple is the ‘subject’ and ‘object’ of a RDF triple. Subject, Predicate and Object are all RDF Terms.

(C15) I miss a reference for ‘Another method is to use a combination of entity or property identifiers and referencing its hash-sum as key in another database to attach metadata information. The usage of such support constructs for metadata management generally increases complexity of the graph structure and queries and can possibly increase processing time’. It is hard to claim such things without a proper reference.

(C16) Several graph query languages are mentioned, but GraphQL [12] or GraphQL-LD [13] are not considered.

(C17) ‘But it also is hard to understand for users as the information of an entity is distributed over many triples’: I do not see this as a problem because KGs were never meant to be human-friendly. They are machine-friendly, but also human readable in RDF formats such as Turtle.

(C18) The authors claim that PGM became really popular in the discussion, but this claim lacks a set of references to support this.

(C19) The authors require support for incremental KG updates, but not all KGs need to have this feature. I think this should be mentioned that it is heavily use case dependent. Moreover, incremental KG updates are not straightforward as it requires handling creation of entities, modifications to entities, and deletion of entities, which involves determining when to consider a change a creation, modification, or deletion.

## 3. Construction Tasks

(C20) I fully agree that the required tasks to perform on the data depends on the type of source input. Data from a relational database is handled differently than from a Wikipedia page. The authors consider ‘unstructured’ and ‘structured’ data, but do not consider semi-structured data e.g. JSON, CSV, XML which lack most of the time a proper structured schema as in a relational database, but are very common in KG construction.

(C21) The authors missed relevant related work in the Data Acquisition subsection e.g. LDES [14], LDF/TPF [15], OSTRICH [16], brTPF [17], SaGe [18], smart-KG [19], WiseKG[20], etc.

(C22) Add some examples and references to ‘Data format transformations have been especially addressed for structured data’. For example: Function Ontology (FnO) [4], SPARQL Functions [5], FunUL [6], etc.

(C23) The ‘Data Cleaning’ subsection stays too abstract to grasp the current state of the domain on this part. Examples and a more in-depth comparison of existing approaches should be added.

(C24) Relevant work was not included in the ‘Metadata Management’ subsection [21].

(C25) The authors claim that a centralized solution for metadata is preferred but this claim is never supported or explained why it is preferred. This is useful information for readers who are unfamiliar with the field.

(C26) Definitions are lacking for introduced terms such as ‘EL algorithms’, ‘Neural relation extraction method’, ‘Multilayer Perceptron’, etc. A separate Definition section in the work would really improve it.

(C27) I don’t understand the following paragraph, it may need to be rephrased: ‘. However, there are some key differences not only in the characteristic modality of the data sources, but for example in the signals that lead to a linking decision. For example in entity linking already linked mentions of an entity x make it more likely that close encounters with a similar mention also lead to entity x, however in entity resolution in many cases the linkage scenario more closely follows a 1 − 1 matching between two data sources under the assumption of deduplicated or clean data sources.’

(C28) ‘The first techniques …’ I would expect here some examples + references to their publications because it is unclear what the ‘first techniques’ are.

(C29) This claim needs to be supported by a reference: ‘While such methods can be incredibly useful to obtain relatively simple relations with high accuracy they are limited in terms of their recall or at least require a high degree of additional human involvement for feature engineering, designing of kernel functions or the discovery of relational patterns.’

(C30) I expect a survey stating the state and challenges to provide a summary of them, but the authors refer to the ‘capabilities and challenges of these approaches’ to another work without providing further information.

(C31) In subsection ‘Entity Resolution and Fusion’ the authors claim the need for incremental KG construction and streaming-like processing. I fully agree on that, but they did not consider relevant related work that already exist addressing this claim [22, 23, 24, 25]

## 4. Overview of KG Construction Pipelines and Toolsets

(C32) A lot of relevant tools are missing for KG construction such as RMLMapper, RMLStreamer, CARML, Chimera, SPARQL-Generate, SPARQL-Anything, SDM-RDFizer, Morph-CSV, etc. A lot of them can be found in [3].

(C33) The authors selected several KG construction pipelines and tools. I agree that they are relevant and should be discussed. However, a proper survey methodology would have included the missing tools & approaches of C32. Currently, it is unclear how the authors made their selection and why the selected ones are relevant to the work.

(C34) DBpedia ‘mappings are fetched from the wiki API’. I would at least include a link to this API or explain what the API is since this may not be known to readers who are unfamiliar with the domain.

# Typos
p4: ‘JSONLD’: JSON-LD
p4: ‘Additionally, it is desirable to reflect’: can be more concise.
p6: ‘But it also is hard to understand’: typo, ‘it is also hard to’
p6: This paragraph is unclear to me, needs rephrasing: ‘In the PGM we can have two equally named but independently addressable relations between two entities, both with individually resolvable edge properties and the issue of interference. However, in RDF-Star, triples (relations) always identify based on their comprised elements, and it is not possible to attach distinguishable sets of additional data to equally named relations without overlapping or utilizing support constructs [46]. The PGM has become increasingly popular for advanced database and network applications (e.g., to analyse social networks) but its limited ontology support has so far hindered its broader adoption for KGs.’
p7: ‘desiderata’
p10: wrong acronym: API != ‘access interfaces’ but ‘Application Program Interface’
p11: rephrase: ‘To make best use of metadata for KG construction asks for the use’
p16: conciseness: ‘it might be worthwhile however to’.
p16: missing commas: ‘To address these shortcomings, statistical’ & ‘lexical, syntactic, and semantic features’.

# References
[1] Continuous generation of versioned collections’ members with RML and LDES, Van Assche et al.
[2] Knowledge Graph Lifecycle: Building and Maintaining Knowledge Graphs, Umutcan S. et al.
[3] Declarative RDF graph generation from heterogeneous (semi-)structured data: A systematic literature review, Van Assche et al.
[4] Implementation-independent function reuse, De Meester B. et al.
[5] SPARQL Query Language for RDF: Recommendation, Prud’hommeaux E. et al.
[6] FunUL: A Method to Incorporate Functions into Uplift Mapping Languages Ademar Crotti Junior et al.
[7] Continuous generation of versioned collections’ members with RML and LDES, Van Assche et al.
[8] Publishing base registries as Linked Data Event Streams, Van Lancker D. et al.
[9] Linked Data Event Streams in Solid containers, Slabbinck W. et al.
[10] Publishing cultural heritage collections of Ghent with Linked Data Event Streams Van de Vyvere B. et al.
[11] https://www.w3.org/TR/rdf11-concepts/
[12] https://graphql.org/
[13] GraphQL LD: Linked Data Querying with GraphQL, Taelman R. et al.
[14] Publishing base registries as Linked Data Event Streams, Van Lancker D. et al.
[15] Triple Pattern Fragments: a low-cost knowledge graph interface for the Web, Verborgh R. et al.
[16] Triple Storage for Random-Access Versioned Querying of RDF Archives, Taelman R. et al.
[17] Bindings-Restricted Triple Pattern Fragments, Hartig O. et al.
[18] SaGe: Web Preemption for Public SPARQL Query Services, Minier T. et al.
[19] SMART-KG: Hybrid Shipping for SPARQL Querying on the Web, Azzman A. et al.
[20] WiseKG: Balanced Access to Web Knowledge Graphs., Azzman A. et al.
[21] Detailed Provenance Capture of Data Processing, De Meester B. et al.
[22] RMLStreamer-SISO: an RDF stream generator from streaming heterogeneous data, SM Oo et al.
[23] Parallel RDF generation from heterogeneous big data, Haesendonck G. et al.
[24] A SPARQL extension for generating RDF from heterogeneous formats, Lefrancois M. et al.
[25] Hierarchical pattern matching for anomaly detection in time series, SV Hautte et al.

# Decision
Since this survey article misses a lot of relevant work from the State of the Art and lacks a comprehensive survey methodology on how the authors collected relevant work, I propose a rejection because this paper needs to be reworked extensively to solve the problems I highlighted in this review.

Review #2
By Umutcan Serles submitted on 23/Apr/2023
Suggestion:
Major Revision
Review Comment:

The paper provides an overview of different tasks in a knowledge graph lifecycle and a qualitative review of various knowledge graph construction tools and knowledge graph-specific approaches from various dimensions. The survey can be placed somewhere in the landscape of surveys that is not overly saturated, in the sense that it views knowledge construction as a lifecycle and not a one-shot event. It focuses on the holistic approach and not on a specific task particularly. The discussion of identified challenges is very valuable.

Nevertheless, the paper still falls short in some aspects and would benefit from a round of improvement. I believe the individual comments are rather minor, but they pile up to be a major revision.

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

This aspect is perhaps the biggest strength of the paper. In fact, more than half of the survey (excluding the references) is dedicated to an introductory text about different tasks in the knowledge construction lifecycle. Different tasks are covered most of the time with the right amount of detail, and the explanations are understandable for interested parties at different stages of their careers.

Some critic points regarding this aspect:
On page 3 authors claim that they make a “more inclusive” definition of knowledge graphs, but for my understanding this definition is mostly a rephrase of the one from Ehrlinger et al.
On page 6 line 10-12, the authors make a statement that the lack of ontology support of PGMs is what has hindered a broader adoption of this graph paradigm for KGs. I do not see a citation here to a study, therefore this statement is only anecdotal. I would “relax” the tone of the statement a bit. There can be many reasons for PGM solutions to stay behind of RDF (if that’s the case), starting with RDF merely being around for a much longer time (even before the term knowledge graph was a ‘daily’ term) and stands on the shoulders of a huge standardization effort.
On page 7 line 3-4, the authors state that the correctness is about entities having unique identifiers and not having duplicates, not sure such a statement is enough the characterize correctness. Having an entity exactly once is more about conciseness or succinctness, having a correct statement multiple times does not make it incorrect. Correctness is about (a) do the statements in a knowledge graph comply to a given formal specification (b) do the statements in a knowledge graph describe their domain correctly (real-world correctness)
On page 12 line 15 - a bit nitpicking - I think “transforming a source into an *equivalent* ontology” is a bit too strong of a statement, because there can be some abstractions during this transformation.
The authors state in Table 3 that Wikidata has an RDF model, but this is not the case. Wikidata has a custom data model that can be mapped to RDF to a large extent. Also, no explanation of what RDB stands for there. Also Wikidata is mainly crowdsourced, but there are also bots involved for semi-automated curation.

(2) How comprehensive and how balanced is the presentation and coverage.

This point is rather hard to judge. First of all, the survey methodology is not clearly explained, except a vague mention of peer-reviewed papers being used, but I cannot tell what were the exact criteria. A section about survey methodology would improve the paper a lot.
Moreover, the related work section is almost non-existing, except a paragraph in the introduction. There are for sure some more relevant surveys see [1] for some examples. Also the authors should also take a look at books published in recent years about knowledge graphs as they usually also contain surveys about the covered aspects.

It is a bit contradicting that the authors emphasize having high quality KG as a major point, but then have a very superficial coverage of the quality assurance task comparing to other tasks.
I think my biggest concern about the paper is that the authors list some requirements but (a) it is not clear how these requirements were derived (b) they are not explicitly mentioned again anywhere in the survey. The authors claim, before they cover some knowledge graphs and more generic tools, that they analyze the approaches according to the tasks and requirements, but an explicit analysis according to the requirements is never done. It is possible to get some sense of such an analysis from the discussion, but I believe this could be done more explicitly.

(3) Readability and clarity of the presentation.

The paper is in general well organized and easy to understand. It would still benefit from a proof reading. Some examples:
Page 1 line 21: requirement*s*
Page 7 line 6-7: the sentence does not make sense
Page 10 line 30: RDF Mapping Language is the correct way to write it
Page 11 line 14 wrong spacing after parentheses
Page 12 line 22: from from
Page 21 line 29: Polishingh
Page 22 line 43: wrong use of however
Page 24 line 28: one extra “criteria”

Figure 2 provides an overview of the construction steps, but the following titles of the subsections do not match the names in this figure one-to-one.

(4) Importance of the covered material to the broader Semantic Web community. Please also assess the data file provided by the authors under “Long-term stable URL for resources”.

The material has certainly a big potential to be an important resource for the broader semantic web community, due to the reasons I mentioned in the beginning of my review. Only point would be to clarify whether the quantitative statistics about number of entities and relations are from the cited papers or were they obtained from somewhere else. I can imagine between 2010 and 2023 there were some changes in the number of entities in NELL.

[1] Hogan, A. (2022). Knowledge Graphs: A Guided Tour. In International Research School in Artificial Intelligence in Bergen (AIB 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

Review #3
By Matthias Sesboüé submitted on 05/May/2023
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

## General comments

It is an excellent paper with valuable insights.
I find very appropriate the consideration of incremental evolution of KGs.
The paper's conclusions for future research focus are enlightening and aligned with the various aspects discussed in the paper.

The authors adhere to the definition of KG: an ontology + the data.
While it is not explicitly written, the various aspects discussed validate this definition.
The paper would benefit from discussing both components of a KG and how they differ and are similar.
It would enable a better understanding of the different steps discussed and which parts they are targeting.
For instance, what is the difference between ontology learning and knowledge extraction?

While "ontology" is used multiple times, no definition is given.

The paper presents a KG as a centralised database. Alternatively, at least the different discussions give this impression.
The KG is presented as a physical database aggregating different sources.
A KG is more of a conceptual idea; its realisation can have different forms.
Hence, in the parts introducing de two main graph models, there is no mention of the possibility of using both to implement the same KG.
In the same line of thought, in the section discussing the KG characteristics, there is no mention of KG implemented as a virtual layer.
It propagates the idea that a KG is a centralised dataset aggregating different sources.

Hence the paper would benefit from some clarifications between the conceptual idea of a KG, the graph structures used for implementation and the related technologies.

## Focus on the evaluation dimensions

(1) Suitability as an introductory text, targeted at researchers, PhD students, or practitioners to get started on the covered topic.
Very good

(2) How comprehensive and how balanced is the presentation and coverage?
Good

(3) Readability and clarity of the presentation.
Very good

(4) Importance of the covered material to the broader Semantic Web community.
Highly relevant

## Comments on specific sections

2.2 and Table 1
RDF is an exchange format, and the paper mentions serialisations in the table in the "exchange format" row.
In this section, what is called "exchange format" are serialisations. There needs to be more clarity between both concepts.

2.2 --> Discussion
I would add some mentions to the following:
- The initial purposes of each model
- The possibility of mixing both approaches

3.1.4 Data Cleaning
For clarity, this section might benefit from a reorganisation. Separate structural vs semantic cleaning, for instance.

3.5.1 Incremental Entity Resolution
While the blocking and matching phases make sense, I needed clarification on what the optional clustering step brings. This section needs to be clarified.

3.6 Quality Assurance
Could benefit from a parallel with ontology learning evaluation techniques (golden standard-based, application-based, data-driven-based, human evaluation).

3.7 KG completion
Would benefit from a short sentence defining what is meant by internal vs external methods (page 20, lines 20 - 22)

3.7.1 Type Completion
page 20, lines 27-28, is it true that there "is no universally agreed upon definition of the PGM,..."?

Section 4.1 would benefit from being split into two subsections.

4.1 page 24, lines 24-25: "Some tools use machine learning approaches for extraction (AI-KG, CovidGraph, dstlr, SLOGERT), while NELL uses a supervised approach to
find new patterns".
This sentence needs to be clarified since supervised approaches are part of machine learning ones.

## Errata

Though not an English native speaker, there are sentences or passages that I did not understand:
- 3.5.1 page 17, line 37 "The input is now the current ..." I think the "is now" is wrong there. No order is explicitly mentioned before.
- 2.2 Discussion page 6, lines 7-8 "In the PGM, we can have two equally named but independently addressable relations between two entities, both with individually resolvable edge properties and the issue of interference." I feel like the "the issue of interference" comes out of nowhere and is unclear.
- 3.4.2 page 15, lines 35-37, those sentences could be clearer English-wise.

Also, for the paper quality, I am just pointing out some typos:
- 3.2, page 11, ligne 14, there is a rogue closing bracket.
- 3.4.2 page 16 ligne 2, "with its type" (not it's)