Review Comment:
# Summary
This paper covers the state and challenges of Knowledge Graph construction focusing on the whole cycle for constructing Knowledge Graphs e.g. data extraction, entity linking, construction of RDF or PGM, quality assurance, etc. They provide a set of requirements they consider, main tasks and their proposed solutions, existing efforts and tools, and discuss open challenges.
General comments
Suitability as introductory text for readers to get started in the domain
The introduction is targeted at a wide audience in the beginning, but is rather short to get a comprehensive introduction already of the domain. Moreover, the author’s problem is already tackled in work of the existing State of the Art which they do not include.
## Comprehensive and balanced presentation and coverage
This paper covers the states and challenges of Knowledge Graph construction as a survey. However, it lacks a proper survey methodology on how the authors conducted their search for relevance work and what they consider or do not consider in this survey. A detailed methodology is extremely important for readers who start in the domain to understand which parts of the domain are covered and which are not.
## Clarity and readability
The paper is clearly written but sometimes lacks explanation or a definition regarding new terms the authors introduce which makes it harder for readers, unfamiliar with the domain, to understand what the authors present.
## Importance to the broader community
The authors highlight interesting challenges for Knowledge Graph construction, but missed relevant related work which already address some of these challenges.
## Provided data and its sustainability
I haven’t found any link to any data repository for this paper, so I assume there is none. This is understandable as there are no experiments performed so they do not have to be reproduced.
# Detailed comments
## Abstract
The abstract is clearly written and contains properly the context, need, and what has been done in this paper to address the need. The abstract focuses on a need for incremental Knowledge Graph construction and the lack of investigation in the Knowledge Graph construction steps. However, both of them are already being addressed in the State of the Art [1, 2]. These relevant works are not considered by the survey.
## 1. Introduction
The introduction describes the problem the authors are facing and what they did to tackle it. An outline is provided for the reader to know what to expect in the next sections.
(C1) The introduction is rather short to be introduced in the domain. I expected here at least a set of research questions the authors wanted to answer by this work to situate the work and what to expect from it as a reader who is unfamiliar with the domain.
(C2) The authors refer to several existing surveys, but do not include relevant surveys such as [3] where an in-depth analysis was performed on Knowledge Graph construction from (semi-)structured data. This survey covers several approaches and tools which this work is missing e.g. Function Ontology (FnO) [4], SPARQL Functions [5], FunUL [6], etc.
## X. Definitions
(C3) The paper would benefit from a separate Definitions section in which the authors define the terms they use throughout the paper.
(C4) For some terms, the definitions is not provided in the paper which is a problem for readers who are unfamiliar with them.
## X. Survey methodology
A survey paper requires a proper comprehensive methodology the authors followed to provide an overview of the domain. This way, readers know how the authors established their work and why certain works are included or excluded.
(C5) However, this survey paper lacks such methodology which makes it hard to understand why certain relevant works are not mentioned in the paper.
## 2. KG background and requirements for KG construction
(C6) The authors assume in this section that KGs are always performing a physical data integration, but that is not true. KGs can be materialized from existing data, or constructed through virtualization which does not perform a physical data integration, but rather provides a virtualized access interface to the KG.
(C7) In contrast to what the authors state in this work, KGs are not a centralized graph-like representation of data, the power of KGs is rather that they can refer to other entities, outside their own datasets. This way, KGs are never a centralized graph-like representation, but a decentralized one.
(C8) I agree with the authors about data warehouses requiring manual intervention to evolve their schema, but so with KGs. KGs also need manual intervention to create ontologies for a KG, implement the mapping from data sources into a specific schema and also handle changes in the schema e.g. foaf:name → schema:name. Using KGs does not solve this problem currently.
(C9) I would appreciate an extensive description for Figure 1. Besides the short description, I would like to see what is interesting about the Figure 1, what are the key points that the reader must remember when they only see the figure?
(C10) ‘Entities have an unique identifier’, but what about RDF blank nodes? They are only unique inside a KG. The impact of blank nodes are not considered in this work while they are used a lot in KGs.
(C11) I lack the ‘why’ when the authors are stating the following: ‘common ontology relations such as is-a (subclass-of) and has-a relationships should be supported’. While I agree, that is-a and has-a relationships are useful, I don’t find a strong argument in this work to support that.
(C12) ‘Temporal development of a KG’ is mentioned as a desired feature, which I fully agree on. Throughout the paper I never found any related work regarding this topic, while there is [7, 8, 9, 10].
(C13) The authors claim that RDF is a ‘data exchange format of the Semantic Web’ which is untrue. RDF is a framework/data model to present data on the Web in graph-like fashion [11]. Turtle, RDF/XML, N-Triples, etc. are all serialization formats of RDF. Later on in the section, the authors correctly mention these serialization formats as formats, but RDF is not a format.
(C14) Use proper terminology: Entities in RDF do not have a ‘URI-based identifier’, but have an IRI as subject [11]. Moreover, ‘head’ and ‘tail’ part of a triple is the ‘subject’ and ‘object’ of a RDF triple. Subject, Predicate and Object are all RDF Terms.
(C15) I miss a reference for ‘Another method is to use a combination of entity or property identifiers and referencing its hash-sum as key in another database to attach metadata information. The usage of such support constructs for metadata management generally increases complexity of the graph structure and queries and can possibly increase processing time’. It is hard to claim such things without a proper reference.
(C16) Several graph query languages are mentioned, but GraphQL [12] or GraphQL-LD [13] are not considered.
(C17) ‘But it also is hard to understand for users as the information of an entity is distributed over many triples’: I do not see this as a problem because KGs were never meant to be human-friendly. They are machine-friendly, but also human readable in RDF formats such as Turtle.
(C18) The authors claim that PGM became really popular in the discussion, but this claim lacks a set of references to support this.
(C19) The authors require support for incremental KG updates, but not all KGs need to have this feature. I think this should be mentioned that it is heavily use case dependent. Moreover, incremental KG updates are not straightforward as it requires handling creation of entities, modifications to entities, and deletion of entities, which involves determining when to consider a change a creation, modification, or deletion.
## 3. Construction Tasks
(C20) I fully agree that the required tasks to perform on the data depends on the type of source input. Data from a relational database is handled differently than from a Wikipedia page. The authors consider ‘unstructured’ and ‘structured’ data, but do not consider semi-structured data e.g. JSON, CSV, XML which lack most of the time a proper structured schema as in a relational database, but are very common in KG construction.
(C21) The authors missed relevant related work in the Data Acquisition subsection e.g. LDES [14], LDF/TPF [15], OSTRICH [16], brTPF [17], SaGe [18], smart-KG [19], WiseKG[20], etc.
(C22) Add some examples and references to ‘Data format transformations have been especially addressed for structured data’. For example: Function Ontology (FnO) [4], SPARQL Functions [5], FunUL [6], etc.
(C23) The ‘Data Cleaning’ subsection stays too abstract to grasp the current state of the domain on this part. Examples and a more in-depth comparison of existing approaches should be added.
(C24) Relevant work was not included in the ‘Metadata Management’ subsection [21].
(C25) The authors claim that a centralized solution for metadata is preferred but this claim is never supported or explained why it is preferred. This is useful information for readers who are unfamiliar with the field.
(C26) Definitions are lacking for introduced terms such as ‘EL algorithms’, ‘Neural relation extraction method’, ‘Multilayer Perceptron’, etc. A separate Definition section in the work would really improve it.
(C27) I don’t understand the following paragraph, it may need to be rephrased: ‘. However, there are some key differences not only in the characteristic modality of the data sources, but for example in the signals that lead to a linking decision. For example in entity linking already linked mentions of an entity x make it more likely that close encounters with a similar mention also lead to entity x, however in entity resolution in many cases the linkage scenario more closely follows a 1 − 1 matching between two data sources under the assumption of deduplicated or clean data sources.’
(C28) ‘The first techniques …’ I would expect here some examples + references to their publications because it is unclear what the ‘first techniques’ are.
(C29) This claim needs to be supported by a reference: ‘While such methods can be incredibly useful to obtain relatively simple relations with high accuracy they are limited in terms of their recall or at least require a high degree of additional human involvement for feature engineering, designing of kernel functions or the discovery of relational patterns.’
(C30) I expect a survey stating the state and challenges to provide a summary of them, but the authors refer to the ‘capabilities and challenges of these approaches’ to another work without providing further information.
(C31) In subsection ‘Entity Resolution and Fusion’ the authors claim the need for incremental KG construction and streaming-like processing. I fully agree on that, but they did not consider relevant related work that already exist addressing this claim [22, 23, 24, 25]
## 4. Overview of KG Construction Pipelines and Toolsets
(C32) A lot of relevant tools are missing for KG construction such as RMLMapper, RMLStreamer, CARML, Chimera, SPARQL-Generate, SPARQL-Anything, SDM-RDFizer, Morph-CSV, etc. A lot of them can be found in [3].
(C33) The authors selected several KG construction pipelines and tools. I agree that they are relevant and should be discussed. However, a proper survey methodology would have included the missing tools & approaches of C32. Currently, it is unclear how the authors made their selection and why the selected ones are relevant to the work.
(C34) DBpedia ‘mappings are fetched from the wiki API’. I would at least include a link to this API or explain what the API is since this may not be known to readers who are unfamiliar with the domain.
# Typos
p4: ‘JSONLD’: JSON-LD
p4: ‘Additionally, it is desirable to reflect’: can be more concise.
p6: ‘But it also is hard to understand’: typo, ‘it is also hard to’
p6: This paragraph is unclear to me, needs rephrasing: ‘In the PGM we can have two equally named but independently addressable relations between two entities, both with individually resolvable edge properties and the issue of interference. However, in RDF-Star, triples (relations) always identify based on their comprised elements, and it is not possible to attach distinguishable sets of additional data to equally named relations without overlapping or utilizing support constructs [46]. The PGM has become increasingly popular for advanced database and network applications (e.g., to analyse social networks) but its limited ontology support has so far hindered its broader adoption for KGs.’
p7: ‘desiderata’
p10: wrong acronym: API != ‘access interfaces’ but ‘Application Program Interface’
p11: rephrase: ‘To make best use of metadata for KG construction asks for the use’
p16: conciseness: ‘it might be worthwhile however to’.
p16: missing commas: ‘To address these shortcomings, statistical’ & ‘lexical, syntactic, and semantic features’.
# References
[1] Continuous generation of versioned collections’ members with RML and LDES, Van Assche et al.
[2] Knowledge Graph Lifecycle: Building and Maintaining Knowledge Graphs, Umutcan S. et al.
[3] Declarative RDF graph generation from heterogeneous (semi-)structured data: A systematic literature review, Van Assche et al.
[4] Implementation-independent function reuse, De Meester B. et al.
[5] SPARQL Query Language for RDF: Recommendation, Prud’hommeaux E. et al.
[6] FunUL: A Method to Incorporate Functions into Uplift Mapping Languages Ademar Crotti Junior et al.
[7] Continuous generation of versioned collections’ members with RML and LDES, Van Assche et al.
[8] Publishing base registries as Linked Data Event Streams, Van Lancker D. et al.
[9] Linked Data Event Streams in Solid containers, Slabbinck W. et al.
[10] Publishing cultural heritage collections of Ghent with Linked Data Event Streams Van de Vyvere B. et al.
[11] https://www.w3.org/TR/rdf11-concepts/
[12] https://graphql.org/
[13] GraphQL LD: Linked Data Querying with GraphQL, Taelman R. et al.
[14] Publishing base registries as Linked Data Event Streams, Van Lancker D. et al.
[15] Triple Pattern Fragments: a low-cost knowledge graph interface for the Web, Verborgh R. et al.
[16] Triple Storage for Random-Access Versioned Querying of RDF Archives, Taelman R. et al.
[17] Bindings-Restricted Triple Pattern Fragments, Hartig O. et al.
[18] SaGe: Web Preemption for Public SPARQL Query Services, Minier T. et al.
[19] SMART-KG: Hybrid Shipping for SPARQL Querying on the Web, Azzman A. et al.
[20] WiseKG: Balanced Access to Web Knowledge Graphs., Azzman A. et al.
[21] Detailed Provenance Capture of Data Processing, De Meester B. et al.
[22] RMLStreamer-SISO: an RDF stream generator from streaming heterogeneous data, SM Oo et al.
[23] Parallel RDF generation from heterogeneous big data, Haesendonck G. et al.
[24] A SPARQL extension for generating RDF from heterogeneous formats, Lefrancois M. et al.
[25] Hierarchical pattern matching for anomaly detection in time series, SV Hautte et al.
# Decision
Since this survey article misses a lot of relevant work from the State of the Art and lacks a comprehensive survey methodology on how the authors collected relevant work, I propose a rejection because this paper needs to be reworked extensively to solve the problems I highlighted in this review.
|