Review Comment:
*** Overall Impression ***
This work presents a publicly accessible tool for SPARQL-based federation, which I think is a hot topic at the moment and of interest for the readers of this journal. The tool is hosted on GitHub, and adequate documentation is provided.
I give a major revision, though, because I think that the work is not ready for publication in this journal, and that several issues needs to be addressed first. Such issues can be summarized in the following points:
1) The setting of the work is unclear.
First and foremost, as a researcher whose main interest is OBDA, I had a hard time in understanding where the "OBDA" in this work actually is. By title, authors claim to propose an "Hybrid OBDA System". What does "Hybrid" mean in the context of OBDA? This is never explained, and in fact the word "hybrid" occurs only in the title of the work and never again.
I also found the following sentence from the abstract quite confusing:
"Ontology-Based Big Data Access solutions by design use a fixed model. e.g., TABULAR, as the only Virtual Data Model - a uniform schema that is built on-the-fly to load, transform, and join relevant data".
This sentence is misleading. OBDA is also known in the literature as "Virtual Knowledge Graphs (VKGs)". The name itself suggests that the virtual data model is a graph, and not a "TABULAR". Also, the word "Virtual" in VKG indicates that such graph is never materialized: users' queries over the KG are reformulated on-the-fly into (traditionally, SQL) queries that are evaluated directly against the data source (which might also be a federation of multiple sources [Gu et al. 2022]). In classical OBDA, there is no load or transformation of data! Even in the original paper by Poggi et al., cited by the authors, the assertions box (ABox) is always specified to be virtual and never loaded or materialized.
The only relation I see with a classical OBDA setting is that also OPTIMA is based on RML/R2ML mappings. However, the similarities stop there:
- From what I understood, OPTIMA's mode of query evaluation relies on an on-the-fly materialization of the data contained in the data sources, rather than on query reformulation and execution against the data sources themselves, which is instead a characteristic of OBDA systems;
- OPTIMA does not seem to support any kind of reasoning with respect to the ontology, whereas query answering through query rewriting is probably the most researched and distinguished feature of OBDA.
I would brand OPTIMA as a SPARQL-based federation system able to provide transparent access to heterogeneous data sources, notably including graph databases. Then mention in the text that sources are connected to wrappers through RML/R2ML specifications, following OBDA principles. This requires a shift on the setting of the work, since the word "federation" currently appears only once in the manuscript.
2) The contribution is unclear.
Given the premises in 1), what is actually OPTIMA adding to existing literature on SPARQL federation? I think the main novelty, and maybe interesting result, is the fact that OPTIMA is able to choose the best virtual schema against which evaluate queries. Apart from this aspect, I do not see other novelties. Actually, an important aspect of state-of-the-art SPARQL federation system is their ability to push-down the execution to the sources whenever this is possible, so as to avoid expensive data transfers, or sophisticated mechanisms for source selection. Are any of these techniques implemented in OPTIMA?
3) The evaluation is lacking details, it is sometimes unconvincing, and some results seem to be misinterpreted.
I find the evaluation insufficient to display the effectiveness of the approach. As I discuss in depth in the "Detailed Comments" section below, the goodness of OPTIMA's ML model is likely to be due to overfitting, and it is not clear whether it can generalize to queries having an unseen structure. Also, I have several doubts on the fact that the approach of loading data at query time into a Spark dataframes can be effective when dealing with huge quantities of data. With respect to this, I expected to see a scalability analysis with different sizes of data, however the authors do not even provide the space dimension for the single test they perform.
4) The work looks unpolished, and it gives the reader a general sense of sloppiness.
This point will become self-evident after reading the Detailed Comments section below, where I list a number of issues.
5) Resources relative to the tool evaluation are missing.
Although the system is made available through GitHub, I could not find anywhere the material authors used for they evaluation. This would have helped clarifying several of the doubts I had when reading the evaluation section.
*** Detailed Comments ***
- Page 1, Line 24: please clearly indicate what is "hard to predict".
- Page 1, Line 47-48: what does it mean to "join two models"? Please clarify, or rephrase.
- Page 1, Lines 46-48: this sentence should be more precise. In its current form, it seems to imply that all "solutions dedicated to Big Data" proceed in an ETL fashion, whereas in-memory processing is usually avoided and only adopted whenever a query a cannot be handled at the level of the single source.
- Page 3, Line 23: general SPARQL queries do not have a single BGP. Indicate clearly what fragment of SPARQL is supported by OPTIMA.
- Page 3, Line 24: "Those sub-BGPs sharing the same subject are called a star-shaped query" -> "... are called star-shaped queries".
- Page 3, Line 25: introduce somewhere what this "mappings file" is, as well as RML/R2ML.
- Page 3, Line 26: this sentence seems to suggest that only direct mappings are allowed: is this really the case? Please observe that full-fledged OBDA systems are able to handle arbitrary R2RML mappings (provided that they are mapping ABox assertions).
- Page 3, Line 31 to end of paragraph: does not this "conversion" involve a materialization of some sort? Isn't this a limitation of your system with respect to other systems that also try to minimize data movement?
- Page 3, Line 34: cite one or two examples of query engines "implementing wrappers".
- Page 4, Figures 3b and 4a: make the figures more explicit, by indicating that the products come from different sources; Figure 4b is too small, and essentially unreadable on paper. Remove the redundant "(a)" and "(b)" labels at the beginning of each caption.
- Page 5: Figures 6a and 6b are too small! Delete the redundant "(a)" and "(b)" labels at the beginning of the caption. Explain the content of these figures!
- Page 5, Lines 25 and 26: the content is very confusing. How do the two "virtual data models, GRAPH and TABULAR" mentioned on Line 25 differ from the two "data models, GRAPH and TABULAR" mentioned on Line 26?
- Page 5, Lines 31-34: "Figure Figure" -> "Figure". Explain the content of the figures! For instance, what do the "blue connections" in Figure 4a, Label 1 represent? How were they computed? That is, how do you decide which comments refer to which product, in absence of other information? Why are some nodes "grayed out" after the merge, and what does this denote? Add citations for "multi-join algorithm" and "incremental join algorithm". Finally, I fail to understand how "Figure 1, Label 2" should help clarifying that "the joining is through connections between star-shaped queries".
- Page 5, Line 43: "Sequerall" -> Squerall.
- Page 5, Lines 43-47: the provided details are not enough to understand the significance of your evaluation. Where do these 5150 queries come from, provided that the BSBM benchmark is made up of 12 (templates of) SPARQL queries? What is the relation between these 5150 queries and the 20 queries described in Table 1? They do not appear to be different "instantiations" of the 20 queries, because 5150 is not a multiple of 20. Also, if these 5150 are just variations of the same subset of (20?) queries, isn't your ML model just overfitting? What is the size of your data? How was data distributed across the different sources? For instance, was table 'Product' fully loaded within one source (which one?), or spread across several ones? How many times was each query executed? Were there warm-up runs? Etc.
- Page 5, Line 50: "...equal to the fixed one?" This sentence is broken. Rephrase.
- Page 6, Table 1: explain the content of this table, and maybe include here also the data source associated to each entity;
- Page 6, Line 12: "compared to the state-of-the-art" -> state-of-the-art in what, precisely?
- Page 6, Line 37: What is the "star" in BSBM*? Spell-out Berlin SPARQL Benchmark, and cite it! Also, in the sentence I read "how explained above", however it was actually never explained how the data was loaded (which table or portion of it went to which source(s)?).
- Page 6, Lines 39-40: where does this query Q21 comes from? It does not appear in Table 1! Is it also part of the "5150" queries? Also, I would not conclude from a single execution of a single query that "OPTIMA is able to support and join large data coming from different datasets". The query (shown in Table 3a) does not even look complicated: it is a single BGP query joining a product with its producer!
- Page 6, Line 42: I might be wrong, but "excels" can be used as a transitive verb only in a limited number of cases. I would use a more standard "outperforms", or something like that.
- Page 6, Line 43: "The time difference ranges from 0 to 80000 milliseconds" -> Actually, in Table 2, I see a difference from 6 to 6431 milliseconds, which is a whole order of magnitude lower than 80000. Could you clarify what you mean?
- Page 6, Line 47: specify what is the virtual data model adopted by Squerall.
- Page 6, Line 42: "As can be observed, OPTIMA excels Squerall for queries that involve multiple joins". I fail to see this, since the time difference between the two systems seems to be independent from the number of joins. For instance, queries Q2 and Q12 have respectively one and zero joins, however the time difference is quite large. Similarly, query Q4 is a join over all of the tables, however the time difference is non-existent. I could make similar comments with respect to the number of projections mentioned in Lines 45 and 46.
- Page 6, Lines 49-50: "table 3b" -> Table 3b. Also, Table 3a is totally unrelated to 3b. Why are they put under the same "Table 3" label?
- Page 7, Table 2: use something (e.g., the bold font?) to denote the "winning" system.
- Page 7, Line 7: this sentence seems broken.
- Page 7, Line 10: "table 5a" -> Table 5a.
- Page 7, Lines 10-11: how were the averages for GRAPH or TABULAR computed? Were they computed over the set of 1030 queries, over the 5150 queries, or over the sub-categories found by the classifier?
- Page 7, Lines 12-13: I cannot parse this sentence.
- Page 7, Lines 17-20: the information provided seems quite incomplete. Do the values refer to peaks or averages? Make this explicit!
- Page 7, Tables 3a, 3b, and 3c: the text in Table 3a is too small; Remove the redundant "(a), (b), (c)" labels from the beginning of the captions; align captions with tables!
- Page 7, Line 39: define the abbreviation "LSTM"; explain the significance of comparing your model against LSTM, also because the sentence from Line 45 to Line 47 seems to suggest that LSTM is indeed not a good model for your task (then, why compare against it?).
- Page 8, Table 4: the times indicated in this table seem to be inconsistent with those in Table 2. In particular, this table reports an execution time over GRAPH of only 1243 ms for Q4, whereas in Table 2 OPTIMA and Squerall have a comparable performance for this query. Could you explain? A similar issue happens for Q5. For Q6, instead, the least time is the one of GRAPH, which is higher than the one indicated for OPTIMA in Table 2. Are these inconsistencies due to the fact that you only ran against 1030 queries? If that is the case, why not to run against the full query set of 5150 queries, since comparing GRAPH to TABULAR has nothing to do with the test of the ML problem?
- Page 8, Lines 15-18: As for Table 2, also in this Table 4 the claimed execution time pattern for queries over GRAPH and TABULAR in relation to join/projections is not so clear. There are unexplained "exceptions", for instance, Q5. Better clarify that the boundaries in ms indicated in the text do not refer to the content of Table 4.
- Page 8, Line 29-30: is your guess based on filters actually grounded? Have you tried removing the filters and see what happens? This does not seem like a difficult check to do, so I do not see a reason why it should be left to guessing.
- Page 8, Line 31: given that no statistics about the size of the data has been provided, this sentence sounds very off.
- Page 8, Line 34: did you mean "non-selective"?
- Page 9, first and second item of the list: CSV and MySQL have the same relational structure. So, if the reason for the loading performance of MySQL is really the relational structure alone, I do not see why this should not hold for CSV as well. Provide a better explanation for what you observed.
- Page 9, Line 24: typically, OBDA approaches do not consider federation, whereas your sentence seems suggests that their focus is on source federation.
- Page 9, Line 28: missing ')' parenthesis.
- Page 9, Line 30: this sentence is slightly imprecise. Ontop, for instance, can query NoSQL sources through Teiid or Dremio (https://ontop-vkg.org/tutorial/federation/).
- Page 9, Lines 32-34: it should be mentioned that Ontop was the OBDA engine underlying Optique. Access to cloud was provided through Exareme (http://madgik.github.io/exareme/).
- Page 9, Lines 43-45: cost estimation of SPARQL queries is totally unrelated from ML. For instance, works like [Lanti et al. 2017] or [Bilidas et al. 2021] were not using ML techniques at all, but standard cost models built on statistics collected over the data source.
|
Comments
“Submission type": Tool/System Report not full paper
This paper was submitted for consideration in the Special Issue: "Tools & Systems ” and “Submission type" should be Tool/System Report.
Corrected
Thanks, this has been corrected.
Pascal.