Review Comment:
In this article, the authors present the IncRML solution: an algorithm and implementation that apply the principles of Change Data Capture to adjust the operation of a knowledge graph construction process based on RML. The approach helps reduce the processing and storage burden of data inherent in the process by only signaling the necessary modifications to be made to the knowledge graph through a data stream encoded using the LDES vocabulary.
Overall assessment
1. Strengths:
- The IncRML proposal addresses an important challenge in the KGC community, which has not been solved in a generic way so far: managing the update of an existing knowledge graph by focusing on targeted modifications (addition, deletion, modification of entities) while using a declarative processing chain (i.e., RML) and semantic web standards (RML, FnO, LDES, SPARQL, etc.).
- Providing an implementation allows the community to validate the proposal and potentially contribute to its improvement.
- The evaluation of the proposal was conducted on test data and real datasets, which provides a good idea of the solution's applicability, particularly through the analysis of performance factors (e.g., data refresh rate, nature of modifications).
- Although the solution is not currently ready for production use out of the box (e.g., mechanisms for orchestrating IncRML need further discussion, resulting LDES data needs to be streamed and managed by graph stores), the explanations provided give a clear idea of the remaining developments to be done within and around IncRML.
- The authors effectively present the technical foundations used in the subsequent sections through the Related Work and Background sections.
2. Weaknesses:
- Despite the mentioned strengths of the proposal, the paper suffers from its form (e.g., burying key explanations in long paragraphs, redundant explanations, inconsistent notations), making the overall reading laborious and potentially limiting the adoption of the approach by individuals with a development and system architecture background. In my opinion, this is the essential adjustment that needs to be made to the article, which can be easily achieved with some courage from the authors to restructure and make necessary cuts in the text.
- Partly related to the previous point on formulation issues, the discussion of the triggering conditions for IncRML seems misplaced, raising doubts about the fundamental change detection mechanism in the proposed approach. The use of the term "advertisement" appears to be part of the problem (being more specific with "signaling" or "detection" could help). The tables and figures (especially Figure 3, which lacks details on the technical ecosystem, and Table 3, which is disconnected from the rest of the explanations) could be improved and integrated, which would easily resolve this issue.
- In subsection 2.1 of the Related Work section, the authors only provide a list of tool names without indicating their function, strengths, or weaknesses. Beyond the fact that a related work section should provide a minimum level of critical analysis of the mentioned items, it gives the impression that the authors are merely justifying the use of RML without further elaboration.
Questions & remarks:
* Could you elaborate more on how IncRML behaves when we don't have the first data frames of a given dataset (i.e. before any modification)?
* Did you looked at how IncRML behaves if the
* Could you elaborate on why you emphasize (many times) on the fact that RMLMapper has "100% coverage for all the test cases"? It seems that you want to say that IncRML is operational with RMLMapper (and thereby useful to practitioners) without really really saying it.
* The §4 starts with an approach overview, then goes into categorizing advertisement stragegies and principles of implicit/explicit change detection in two separate subsections. This seem a very natural way of breaking down the challenges IncRML tries to address, however it looks like in §4.3 that the core of IncRML stands on the shoulders of implicit changes and the implementation of a stateful processing (p. 9, lines 11-27) ... could you make more clear, starting in §4.1 & §4.2, what are the fundamental principles of the IncRML framework and how each "advertisement strategies" compose? It seems to me that this could be achieved by reorganizing the order of the subsections + reworking Table 1 with details of §4.2 & §4.3 + explaining from the begining if algorithms from Figure 2 run in parallel and if they have branching / common parts.
* It seems that the Figure 3 is a very (too much?) high level view of the IncRML proposal to give a stronger feeling of the technical solution. For instance, should we run multiple instances of this pseudo pipeline if we have to deal with multiple data sources? Is the CDC feature part of the framework or could it be an external component? Is the CDC block trigerring the KGC step? What kind of component consumes the "RDF Incremental" data? What (kind of) component orchestrates the execution of IncRML? and at what (potential) frequency? ... indeed it could be interesting to bring the Figure 3 closer to SysML / UML / BPMN / flow diagram standards. It could also be nice to show where the algorithms from Figure 2 live within such pipeline.
* Missing info: p.11, line 17, the footnote 10 is given (which is good) without further explanation of what it is about (i.e. context). It could be interesting to give this URL earlier in the article and saying that the IncRML core features are available in open source. Further, as the IDLabFunctions.java may evolve with time, it could be interesting to share this resource as a code snippet for a more stable reference from the article.
* Subsection 5.4 (p.14, line35): could you elaborate more on the usefulness of generating "3 named graphs"?
* It seems that explaining the IncRML approach through the example of Table 3 could be more impactful with minor modifications of the table. For example, it could be more interesting to have the table as a figure where we would see the initial data (part a.) aligned with the changed data (part b.) within a flow diagram highlighting with action tags (e.g. implicitUpdate) modified age values (46 & 40), the deletion of the row with ID #3 and the addition of the row with ID #4. The Figure 3 could serve as a basis for the flow diagram. The LDES structure from Figure 1 could also serve as a basis for explaining how updates materialize, instead of having a very generic Figure 1.
* Subsection title: for the §7.4 it seems to me that "lessons learned" would be a better title instead of "Discussion" as analyses of the results are already provided in each parts of §7.2 and §7.3. In the same line of thought, it could be nice to summarize the capabilities, performance gains, strengths & weaknesses of the whole evaluation process in a single table.
* Missing info/clarification: in §8 conclusion you very briefly suggest future work in terms of R&D axis although you mentioned limitations of your approach in §4.1 (i.e. "an additional step in the pipeline [...]") and have many lessons learned in §7.4. It could be interesting to elaborate with more details in §8 what would be the next steps to bring IncRML production-ready and suitable as a component of KG-based plaform.
Minor remarks:
* Listings are difficult to read in a black and white printed version of the article.
* Typo: p. 10, line 3 in "If not, an created member".
* Typo: p. 10, line 18 in "Section 4.3; and"
* Typo: p. 10, line 22, "should change" instead of "have changed"?
* Typo: p. 10, line 23 in "TriplesMap8.)"
* Formating issue: p.10, footnote 8, on the "as:" prefix
* Listing 4: it could be interesting to highlight the "idlab-fn:implicitUpdate" term.
* Typo: p. 13, line 2, reference [5] was already defined earlier in the text and, by the way, is repeated again latter.
* Listing 5: it would be nice for you to improve the readability of the listing by inserting blank lines between code blocks and more explanations as code comments (e.g. the explanations from the listing's caption).
* Formulation unclear/confusing: p. 16, line 40, in "may achieve, might be".
* Table 4: the table might be useless (e.g. it could be shown only in the code repo) or, at least, not informational or not well formated (e.g. "purpose" could be symbols defined in the table's caption, "ID" could hold the availability/advertisement/type in the file name, a \hline could split the table between each availability category).
* Subsection title: for the §6.1.3 (and generaly) it might be interesting to use the term "dataset" instead of "use case".
* Table 6-8: it could be nice to explain that underlined numbers mean "lowest observed value", and also remind that, in the present cases, "the lowest the better".
* Table 6-8: it could be nice to add columns with an all/change ratio or percentage for quick analysis of the results.
* Formulation issue: p. 20, line 31 (and generally), it is preferable to argument with quantitative values instead of solely using stances such as "a slight increase" => "a slight increase (xxx% in average)" or "a xxx% increase in average, which we consider as minimal with respect to ...". This is to be found again in §8 line 12 where you write "We show how our approach reduces [...]" instead of a more constrasted formulation like "To achieve this we did ..., which, through an evaluation process based on 5 representative datasets, showed that our approach in average reduces [...] by a factor xxx% [...]".
* Formulation unclear: p. 21, line 50, it looks like writing "We observe an increase in the amount of [...]" would make the paragraph more nice to read.
* Typo: p. 25, line 50 in "the additional the metadata"
* Typo: p. 26, line 43 in "was deleted", don't you miss a "thus"?
* Formulation unclear/inconsistent: p.26, lines 45 & 47 + p. 27, line 43, and generaly, it could be nice to use a same way of writing performance gains across the whole article (i.e. instead of "4.42 times faster", then "factor 3.24" without positive/negative indication).
|