Review Comment:
General opinion: as "reviewer #2" of the first version of the paper, I consider that the proposed IncRML approach is promising and well-structured, and that in this new version of the paper, a large majority of the improvement points I mentioned have been positively addressed. I therefore recommend this paper for publication while suggesting a few more points for improvement to the authors (see below), which mainly consist of adding clarifications, details, and enhancing the writing flow.
Questions & remarks:
1. "2.2. Change Data Capture" : I suggest that you end the subsection emphasizing that in this paper you'll be focusing on the snapshot-based and timestampb-based approaches, namely in §4.3.
2. "2.5. Incremental mapping rules execution" and global : you often use the expression "support heterogeneous data," which is a good thing in itself, but - given that the IncRML proposal is very interesting for operational contexts where the knowledge graph would be built on complex datasets - it seems to me that you should clarify that the paper focuses experimentally on mono-source heterogeneous data and suggest (e.g., in the conclusion) ways to integrate IncRML in multi-source contexts. This kind of clarification would facilitate, as mentioned in my previous review, the adoption of the IncRML proposal by teams of system architects / system integrators by providing them with a clear vision of the proposal's scope. Paper [1] referred to such multi-source situations and proposed a posteriori reconciliation of data via an orchestration system; perhaps you could draw inspiration from such references to position IncRML.
3. "4.1. IncRML" and "Fig. 3" : I understand that you focus on CDC management and the production of a change log since it is the core of your research and proposal, but it seems to me that you should assist the reader just a little bit more by presenting the entire processing chain, i.e., by highlighting the LDES client and the triplestore. Since Fig. 3 is already quite dense, a written formulation such as "dynamic dataset --[FNO-based CDC]--> extracted changes --[RML-based change signaling]--> LDES artifacts --[LDES-based materialization]--> Triplestore" positioned right at the beginning of §4 would easily resolve the issue.
4. "5.2. KG generation with RML and CDC FnO functions": The phrase "A separate Triples Map is optionally used to [...]" is suggestive, while you make ample use of the proposed technique in the following sections; it seems useful to rephrase it with a wording such as "can be used, as exemplified later in Listing 5 for [...]"
5. "5.2. KG generation with RML and CDC FnO functions": The argument "but other ontologies can be used in the mappings as well" is open-ended; could you provide references for examples?
6. Listings 6 and 7: It could be useful to remind the reader that these artifacts are produced at each occurrence of an IncRML run.
7. "5.5. GTFS Madrid Benchmark extension": This development / section is a consequence of the evaluation step, hence it should'nt be part of §5 but of §6.
8. "6.1.4. Ingestion": Although I know what Virtuoso is, I suggest that you add a reference to the tool in the sentence "We perform this measurement over a Virtuoso instance" for readers who are not familiar with it. Additionally, please provide the Virtuoso release version for the sake of reproducibility of the experimental setup.
9. "6.1.4. Ingestion": I strongly regret that the tool https://github.com/julianrojas87/incrml2sparql, which is a key element of the proposal and experimentation, is not better explained or highlighted. A process diagram describing the principle of the implemented algorithm is at a minimum desirable, both for the clarity of the discussion (the algorithm and the tool should have been presented as early as section 5) and because there is no guarantee of the longevity of this code repository.
10. "7.4. Ingestion": It could be useful to add details on how you overcame the query length limit (i.e., the --limit parameter in incrml2sparql and what the optimal value you used was). Also, what about using a dataset chunking strategy?
11. "8. Conclusion": It could be useful to add references or a glimpse of how-to for some of your future work claims in order to provide a sustainable direction, particularly in the sentences "as incremental generation has only been tackled recently" and "handling schema-level (ontology and mappings) changes efficiently is an important aspect to be considered."
Minor remarks:
- Typo in "6.2. Evaluation setup": "to use theLDES Logical Target" => add a space before LDES.
- Missing caption in "Table 6" => add "Lower is better".
- Typo in "Table 12": "Total execution time (s" => add a closing parenthesis.
Refs:
- [1] "Designing NORIA: a Knowledge Graph-based Platform for Anomaly Detection and Incident Management in ICT Systems" (Tailhardat, et al. 2023) https://ceur-ws.org/Vol-3471/paper3.pdf
|