Improving the ShExML engine through a profiling methodology

Tracking #: 3680-4894

Authors: 
Herminio Garcia-Gonzalez

Responsible editor: 
Guest Editors KG Construction 2024

Submission type: 
Tool/System Report
Abstract: 
The ShExML language was born as a more user-friendly approach for knowledge graph construction. However, recent studies have highlighted that its companion engine suffers from serious performance issues. Thus, in this paper I undertake the optimisation of the engine by means of a profiling methodology. The improvements are then measured as part of a performance evaluation whose results are statistically analysed. Upon this analysis, the effectiveness of each proposed enhancement is discussed. As a direct result of this work the ShExML engine offers a much more optimised version which can cope better with users' demands.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jakub Klimek submitted on 17/May/2024
Suggestion:
Reject
Review Comment:

The author presents a report on how they investigated bottlenecks in their software, the ShExML engine, through software performance profiling techniques, and how the found bottlenecks were addressed in different versions of the software.
The ShExML engine seems to be currently the only implementation of the ShExML language for mapping data to RDF.
The implementation is hosted in GitHub

The ShExML language itself is not introduced (not even briefly) in the paper, which makes it difficult for the reader to get into the right context, makes the paper seem not self-contained, and lowers the readability of the paper. I recommend the author to introduce the language at least briefly, stating what are the inputs, key building blocks and basic functionalities. Judging by the citation counts of other ShExML papers, it is not as well-known as, e.g., XML or RDF, that it would not need introduction in the context of a report on a tool implementing it. This fact, and the statistics on the corresponding GitHub repository of the tool, also show a so far limited impact of the tool, of which there is also no evidence or discussion in the paper.

An architecture of the ShExML engine is illustrated, but it is unclear which (if any) formalism is used to do so. Some of the symbols would suggest Archimate, but it is not stated in the paper. Here I would recommend to use some well established formalism (Archimate, UML, C4 model) when introducing a software architecture.
The description of the architecture starts with lexical analyzers, however, since at this point it is unclear which data formats are supported on the input, it is also unclear what the lexical analyzers actually analyze. Is it any text-based format?

Next, the RDF generation algorithm is introduced via a listing of two algorithms. Here, the word "shape" gets used (p3 l38), but it is unclear what exactly is meant by that, since ShExML was not introduced. Also, for the first time in the paper, we can see that JSON and XML are probable input data formats given that JSONPath and XPath query languages are mentioned here.

In section 4 about the profiling of the software, a profiling methodology is mentioned, however, it is unclear whether it is some existing methodology, or if the methodology is the rather informal paragraph itself, where the author states the usage of Java VisualVM and the process of running and monitoring the ShExML engine.

Next is an overview of 4 consecutive versions of ShExML engine along with, again quite informal, description of what was fixed in each of them based on the findings from the software profiling.
Finally, the author provides a statistical evaluation of the various versions based on time spent on mapping various inputs to RDF.

Overall, the paper reads more like a development diary or a blog post rather than a journal paper. The author does not fulfil the requirements of a SWJ "Tools and systems report" type submission, as they do not discuss the impact of the tool and they do not discuss the capabilities of the tool as a whole, and rather discuss minor optimizations throughout several versions of the tool. In addition, the language of the paper is informal at places with imprecise expressions like "However, as this process can take some seconds to set up I included a general delay of 20 seconds in the main method of the ShExML engine in order to ensure the capture of the relevant data.", or "methods that were taking more time than what was deemed appropriate." without stating how "appropriate" is defined.
The paper would have to be redone and its contents changed too much to fit the requirements of a "Tools and systems report" in a major revision. Based on these arguments, I recommend rejecting the paper.

Minor issue: references are not correctly capitalized, e.g. "Why to Tie to a Single Data Mapping Language?
Enabling a Transformation from ShExML to RML" instead of "Why to tie to a single data mapping language? enabling a transformation from shexml to rml"

Review #2
By Nuno Lopes submitted on 10/Jun/2024
Suggestion:
Minor Revision
Review Comment:

The paper describes different optimisations applied to the ShExML engine throughout different releases. The changes are well described, albeit a bit too focused on the implementation details. For instance, the engine architecture section links to specific files of the code, I believe the paper would benefit if these details mentioned only the architecture diagram details.

The biggest performance improvements were achieved by IO optimisations, like avoiding multiple downloads and not using full file contents for id generation, and changing of external libraries. I would have preferred that these tasks were done beforehand (included in the baseline) and have the paper focus only on the evaluation of algorithm changes.

Although not the scope of this paper it could benefit from a comparison with other engines, it would elevate the paper to also position the optimisations in comparison to the other engines.

Some other minor details:
- the position of listings and algorithms. Listing 1 is presented before Algorithm 1 and 2 even though it's mentioned after.
- Listing 1 is also shown before listing 2 but listing 2 is referred first.
- Listing 3 doesn't provide any additional value and can be replaced with a small textual explanation
- line 40 of algo 1 is missing the ST param

Review #3
Anonymous submitted on 17/Jun/2024
Suggestion:
Major Revision
Review Comment:

The paper describes optimizations done on ShExML engine.

3. Engine
I think that a working example on how the engine works would be very helpful, so instead of explaining details of the implementation you could use this example to provide to example archtecture's engine. For example in Listing 1 you have an example, but this is the first time we see this films file being mentioned, maybe one with the sample data would help. I would suggest to start with the working example, explaining the test data and mappings, and then how the engine works on these input.

4. Profiling
Could you add details on the performance issues in relation to CPU/RAM etc?

Instead of by version, you could list the improvements in a list, or maybe start with a summary and then details. Something like:

0.3.3 File caching
0.4.0 Filtering function. What is the actual change? withFilter for filter, why is one faster than the other?
0.4.2 Unclear what the changes are.
0.5.1 Source data libraries. Could you explain why these libraries perform so differently? Would be the case that the issue is the test data, for example?

5. Evaluation
You can start with the methodology, the cpu/ram and OS etc before showing the results.

You say 1000 entries for Films, mostly a flat hierarchy? but then how many entries for EHRI institutions, you say it is hierarchical but maybe an example of the structure would help. What are the file sizes and CPU/RAM comsuption?

I miss a direct comparison to the SPARQL-Anything performance evaluation. You say in the introduction that the performance was worse than competitors, but then this is not discussed anymore. Are your experiments comparable at all? Did you try running the experiment on the same data/mapping as them? Similar setups etc?

Review #4
By Ana Iglesias-Molina submitted on 26/Jun/2024
Suggestion:
Major Revision
Review Comment:

This paper presents the process to improve the ShExML processor in order to overcome its limitations regarding performance time for KG construction. It first presents the architecture of the engine, then it identifies the bottlenecks, explains how they are solved, and performs an evaluation of the subsequent versions in which the bottlenecks are addressed to showcase how the performance improves with the versions.

In general I find the paper as a good contribution for the issue, and I specially appreciate the statistical analysis for the evaluation, which is often overlooked in this type of papers. All resources are available and properly published in GitHub releases and Zenodo. However, there are several aspects in which the paper need improvement, which I explain in detail below.

--- 2. Related work ---

I feel that in a certain way, this section misses the point. While it highlights where the ShExML engine has been evaluated with other engines, it misses papers presenting optimizations for KGC in similar engines (see for instance [1-3]). Most of the papers presenting new or improved engines in this field come with a comparison with other engines, the selection made in this section is somehow narrow; the criteria, unclear; and the order, confusing. For instance, the first and third paragraph talk about engine comparisons, second and fourth about benchmarks; paper [5], that presents an engine evaluation, is cited somewhere in the paper but for some reason it is not relevant for this section? I would suggest to reorder the section and enrich it with more similar papers, which there are plenty. As a side note, the challenge in the KGCW workshop has two editions now: https://w3id.org/kg-construct/workshop/2024/challenge

[1] Arenas-Guerrero, J., Chaves-Fraga, D., Toledo, J., Pérez, M. S., & Corcho, O. (2024). Morph-KGC: Scalable Knowledge Graph Materialization with Mapping Partitions. Semantic Web, 15, 1–20.
[2] Iglesias, E., Jozashoori, S., & Vidal, M. - E. (2023). Scaling up Knowledge Graph Creation to Large and Heterogeneous Data Sources. Journal of Web Semantics, 75, 100755.
[3] Iglesias, E., Vidal, M., Jozashoori, S., Collarana, D., & Chaves-Fraga, D. (2022). Empowering the SDM-RDFizer tool for scaling up to complex knowledge graph creation pipelines. Semantic Web.

--- 3. ShExML engine algorithm ---

This section can use some improvements in terms of clarity in descriptions, the author takes for granted too many concepts and it becomes a section hard to follow. For instance, concepts like “pipes and filters architectural pattern” or ANTLR are not explained what they are, the text neither figure 1 explicit which are the inputs and outputs of the process (is the data, the mapping, both, one or the other depending on the component?).

Reaching Section 3.2, I realized the concept of “shape” in terms of the ShExML mappings have not been explained. Maybe this paper could add a brief background section explaining the basics of the ShExML language (like many [R2]RML papers do respectively) so that the reader can better understand the implications and components of the language.

I like the example in the listings, it really helps understand how the engine process data. I also think that adding another listing with the input json data file in the link mentioned in Listing 1 would be even better, so that the reader doesn’t have to go look manually.

--- 4. Profiling the ShExML engine and performance improvements ---

Similarly with the previous section, this one can also improve clarity in descriptions, starting by explaining what is a profiling methodology, and why it is suitable. But my major concern is with the presentation of the bottlenecks, I have mixed feelings. On the one hand, it makes sense to present the changes per version wrt the evaluation presented in the next section. On the other hand, it is claimed along the paper that the improvements could help other engines to improve. However, since the descriptions of the bottlenecks and solutions are encapsuled within the versions, they are harder to distinguish. This claim is fair and solid, this bottlenecks may apply to other different engines, that is why I would recommend to rearrange the section and make a proper description of each bottleneck, and the solution proposed, linking in the end with which version it is addressed. This way the paper can prove more useful in this regard for other practitioners, because now it is more “engineerish” and focused on the versions rather than on the problems themselves. I believe this rearrangement and change of focus can also enrich how the results are discussed and presented.

--- 5. Evaluation ---

In general the evaluation seems solid, I only have a few remarks.
- It would be useful to include the reasons to choose the statistical tests presented in the paper, they are not that common so it is worth explaining the particularities that make them the most suitable for this case.
- Have the author considered measuring not only the performance time, but also the CPU and RAM usage?
- I also miss a general reflection on which bottlenecks were the most critical, or beneficial in terms of balance effort/improvements
- Is the engine, after the improvements, competitive now with similar KGC engines? I understand the scope of the paper is to check that the optimizations work wrt previous versions, but it would be also beneficial to check how it is performing now wrt the state of the art engines.

Minor:
- Figure 1 can improve space economy and make the text bigger, maybe adding numbers to the steps could help the reader follow the process in the text too
- Section 3.2, composeIterationQuery: what happens with tabular data? May worth explaining as well and not only focus on hierarchical data
- Tables 1 and 2 could be box or violin plots, it would make it easier to compare visually, but the table gives all the information in any case. The names of the engines can be shortened to only the version, (ShExML-v0.3.2.jar --> v0.3.2) as the rest remains the same in all.
- Throughout the entire paper, sentences are in general too long, and there is a general lack of commas