Review Comment:
The paper tackles the problem of knowledge graph (KG) construction. It proposes planning and execution techniques to speed up KG construction pipelines specified using mapping rules in [R2]RML. The proposed method relies on the partition of mapping rules so that the evaluation of the groups in the partition, reduces the duplicated generation of RDF triples and maximizes the parallel execution of the mapping rules. The proposed techniques are implemented in MorphKGC, an RML-compliant engine; the behavior of MorphKGC is assessed in three existing benchmarks. The reported results suggest that the proposed methods can accelerate the execution of mapping rules as the ones composing the studied benchmarks.
Overall, the paper is relatively well-written and presents an efficient solution to a relevant data management problem. The experimental results provide evidence of the benefits that planning the execution of the mapping rules brings to the process of KG construction. Moreover, the outcomes of the empirical evaluation put in perspective the improvements achieved by the proposed planning techniques. However, the current version of this work suffers from several issues that considerably reduce its value. First, the problem is not formally defined, the definitions are not well-formulated and several relevant concepts are idle defined. Second, the complexity and correctness of the proposed algorithms, and the conditions that prerequisite their performance and soundness are not even mentioned. Third, the experimental study is conducted over a limited set of test cases, preventing, thus, the reproducibility of the reported outcomes. Finally, the state of the art is superficially analyzed, and the proposed techniques are not positioned with respect to similar data management techniques.
All these issues prevent from a positive evaluation of the current version of the work. The recommendation is for a major revision addressing all the following comments.
Mathematical Pitfalls:
• Definition of equivalent triples maps. The conditions to be met to decide when two triples maps are equivalent are not formally stated. Note that two triples maps can produce equivalent results even if they are defined over two different logical sources. However, the informal definition presented on page 3, suggests that both triples maps should be defined over any given data source. Moreover, the property of “two equivalent term maps with different positions are not equal”. The differences between equivalent and equal must be differentiated. What is the complexity of the problem of deciding if two triple maps are equivalent?
• “Because of RDF set semantics”. An RDF document is formalized as a graph, please, clarify and reference to which RDF set semantics the authors refer to.
• It is never defined when an RDF triple is generated from the evaluation of a triples map over a logical data source. Please, formally state the results of a triple map evaluation.
• Definitions of position(.), type(.), value(.), and literaltype(.) incorrectly assume in the domain a single element T. Contrary, the domain in all these cases must be a set of term maps.
• Self-joins are used without any definition in the context of mapping rules.
• Definition 1. The concepts [R2]RML document, equivalent normalized [R2]RML document, and [R2]RML documents without self-joins are not used without any previous formal definition. Consequently, the description of canonical [R2]RML document is mathematically incorrect.
• Definition 2. A partition of a set X is a set of subsets of X, and the relationship of the parts G_i of P is not defined. Also, please, note the time complexity of enumerating all the possible subsets of a set X.
• Definition 3. A template should be defined in terms of the different structures that it may take, and then a prefix can be defined.
• Definition 4. An invariant is wrongly defined. T is a term map, while an invariant I is a string, and making I equal to T is a type mismatch. Also, the “if” conditions only state sufficient prerequisites. Please, provide sufficient and necessary conditions of the values of an invariant I.
• Definition 5. Term maps are considered sets, even though a term map is never presented in this way
• Definition 6. The definition of Disjoint mapping rules relies on the definition that a triple set is generated from a mapping rule which has never been defined.
• Definition 7. It presents a property instead of the definition of a Disjoint mapping Groups. Also, it relies on definition 6 (which is incorrect) and assumes that a triple map is a set.
• Definition 8. It defines a Maximal Mapping Partition of an [R2]RML document as the one with the largest number of mapping groups. This is exactly the one where each group is a singleton set unless another missing condition needs to be satisfied. What is the complexity of finding a Maximal Mapping Partition?
Proposed Algorithms:
• Algorithm 1 simply replaces an object reference for the template definition of the corresponding parent triples map. Algorithm 2 and 3 use functions which are not defined.
• Heuristics implemented by Algorithm 3 are not defined and because the problem solved by Algorithm 3 is not defined, it is impossible to demonstrate its correctness.
• The complexity of Algorithm 3 requires to be demonstrated.
Empirical Evaluation
• The empirical evaluation does not consider engines that also implement similar planning techniques, e.g., SDM-RDFizer 4.0 [1].
• The meaning of Table 1 is not clear, and the reported results do not have any connection with the rest of the outcomes presented in this section. Please report the results if the experimental study was also conducted over all these datasets. Otherwise, eliminate this table. It is misleading for the reader.
• The absolute values of figures 4 and 5 should be reported.
• Detailed discuss of the conditions to be met by a data integration system to benefit from the proposed techniques. For example, what happens when the joins between triples maps are not self-joins, and the selectivity change? The authors claim that they have discussed the parameters that impact the execution of the KG construction process in previous work. However, only a few of them are considered in this study, reducing, thus, reproducibility of the results in more general testbeds.
Related Work
• This section simply describes tools instead of analyzing the data management techniques proposed by each of the existing approaches. Please, provide a deeper analysis of the problems solved by the approaches reported in the literature, their pros and cons of the proposed techniques, and position your techniques with respect to them.
Minor comments
• The benchmark Genomics – TIB is mentioned but then, the COSMIC testbed is used. Are they both the same? Do the authors refer to the benchmark available at [2] or at any other benchmark?
[1] https://github.com/SDM-TIB/SDM-RDFizer
[2] https://figshare.com/articles/dataset/SDM-Genomic-Datasets/14838342/1
|