OPTIMA: A Hybrid OBDA System for Efficiently Querying Large Heterogeneous Data

Tracking #: 3249-4463

Authors: 
Chahrazed B. BACHIR-BELMEHDI
Dr. Abderrahmane Khiat
Nabil Keskes

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Abstract: 
The current decade is witnessing a remarkable evolution in terms of big data virtualization. Data is queried on-the-fly against the original data sources without any prior data materialization. Ontology-Based Big Data Access solutions by design use a fixed model, e.g., TABULAR, as the only Virtual Data Model - a uniform schema that is built on-the-fly to load, transform, and join relevant data. While other data models such as GRAPH or DOCUMENT are more flexible and, thus, can be more suitable for some common types of queries such as join or nested queries. Those queries are, in many cases, hard to predict because they depend on many criteria such as query plan, data model, data size, operations e.g., join, filter. To address the problem of selecting the optimal virtual data model for various queries on large datasets, we develop OPTIMA. OPTIMA is a framework that (1) builds on the principle of ontology-based data access to enable the querying, aggregating, and joining of large heterogeneous data in a distributed manner using a unique query language SPARQL and (2) calls the deep learning method to predict the optimal virtual data model using the features extracted from SPARQL queries. OPTIMA currently leverages state-of-the-art Big Data technologies, Spark, and implements two virtual data models, GRAPH and TABULAR, and supports out-of-the-box five data sources Neo4j, MongoDB, MySQL, Cassandra, and CSV. Extensive experiments show that OPTIMA returns the optimal virtual model with an accuracy of 0.831, thus reducing the query execution time by over 40% in favor of tabular model selection and over 30% for the graph model selection.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Davide Lanti submitted on 24/Oct/2022
Suggestion:
Major Revision
Review Comment:

*** Overall Impression ***

This work presents a publicly accessible tool for SPARQL-based federation, which I think is a hot topic at the moment and of interest for the readers of this journal. The tool is hosted on GitHub, and adequate documentation is provided.

I give a major revision, though, because I think that the work is not ready for publication in this journal, and that several issues needs to be addressed first. Such issues can be summarized in the following points:

1) The setting of the work is unclear.

First and foremost, as a researcher whose main interest is OBDA, I had a hard time in understanding where the "OBDA" in this work actually is. By title, authors claim to propose an "Hybrid OBDA System". What does "Hybrid" mean in the context of OBDA? This is never explained, and in fact the word "hybrid" occurs only in the title of the work and never again.

I also found the following sentence from the abstract quite confusing:

"Ontology-Based Big Data Access solutions by design use a fixed model. e.g., TABULAR, as the only Virtual Data Model - a uniform schema that is built on-the-fly to load, transform, and join relevant data".

This sentence is misleading. OBDA is also known in the literature as "Virtual Knowledge Graphs (VKGs)". The name itself suggests that the virtual data model is a graph, and not a "TABULAR". Also, the word "Virtual" in VKG indicates that such graph is never materialized: users' queries over the KG are reformulated on-the-fly into (traditionally, SQL) queries that are evaluated directly against the data source (which might also be a federation of multiple sources [Gu et al. 2022]). In classical OBDA, there is no load or transformation of data! Even in the original paper by Poggi et al., cited by the authors, the assertions box (ABox) is always specified to be virtual and never loaded or materialized.

The only relation I see with a classical OBDA setting is that also OPTIMA is based on RML/R2ML mappings. However, the similarities stop there:

- From what I understood, OPTIMA's mode of query evaluation relies on an on-the-fly materialization of the data contained in the data sources, rather than on query reformulation and execution against the data sources themselves, which is instead a characteristic of OBDA systems;
- OPTIMA does not seem to support any kind of reasoning with respect to the ontology, whereas query answering through query rewriting is probably the most researched and distinguished feature of OBDA.

I would brand OPTIMA as a SPARQL-based federation system able to provide transparent access to heterogeneous data sources, notably including graph databases. Then mention in the text that sources are connected to wrappers through RML/R2ML specifications, following OBDA principles. This requires a shift on the setting of the work, since the word "federation" currently appears only once in the manuscript.

2) The contribution is unclear.

Given the premises in 1), what is actually OPTIMA adding to existing literature on SPARQL federation? I think the main novelty, and maybe interesting result, is the fact that OPTIMA is able to choose the best virtual schema against which evaluate queries. Apart from this aspect, I do not see other novelties. Actually, an important aspect of state-of-the-art SPARQL federation system is their ability to push-down the execution to the sources whenever this is possible, so as to avoid expensive data transfers, or sophisticated mechanisms for source selection. Are any of these techniques implemented in OPTIMA?

3) The evaluation is lacking details, it is sometimes unconvincing, and some results seem to be misinterpreted.

I find the evaluation insufficient to display the effectiveness of the approach. As I discuss in depth in the "Detailed Comments" section below, the goodness of OPTIMA's ML model is likely to be due to overfitting, and it is not clear whether it can generalize to queries having an unseen structure. Also, I have several doubts on the fact that the approach of loading data at query time into a Spark dataframes can be effective when dealing with huge quantities of data. With respect to this, I expected to see a scalability analysis with different sizes of data, however the authors do not even provide the space dimension for the single test they perform.

4) The work looks unpolished, and it gives the reader a general sense of sloppiness.

This point will become self-evident after reading the Detailed Comments section below, where I list a number of issues.

5) Resources relative to the tool evaluation are missing.

Although the system is made available through GitHub, I could not find anywhere the material authors used for they evaluation. This would have helped clarifying several of the doubts I had when reading the evaluation section.

*** Detailed Comments ***

- Page 1, Line 24: please clearly indicate what is "hard to predict".
- Page 1, Line 47-48: what does it mean to "join two models"? Please clarify, or rephrase.
- Page 1, Lines 46-48: this sentence should be more precise. In its current form, it seems to imply that all "solutions dedicated to Big Data" proceed in an ETL fashion, whereas in-memory processing is usually avoided and only adopted whenever a query a cannot be handled at the level of the single source.
- Page 3, Line 23: general SPARQL queries do not have a single BGP. Indicate clearly what fragment of SPARQL is supported by OPTIMA.
- Page 3, Line 24: "Those sub-BGPs sharing the same subject are called a star-shaped query" -> "... are called star-shaped queries".
- Page 3, Line 25: introduce somewhere what this "mappings file" is, as well as RML/R2ML.
- Page 3, Line 26: this sentence seems to suggest that only direct mappings are allowed: is this really the case? Please observe that full-fledged OBDA systems are able to handle arbitrary R2RML mappings (provided that they are mapping ABox assertions).
- Page 3, Line 31 to end of paragraph: does not this "conversion" involve a materialization of some sort? Isn't this a limitation of your system with respect to other systems that also try to minimize data movement?
- Page 3, Line 34: cite one or two examples of query engines "implementing wrappers".
- Page 4, Figures 3b and 4a: make the figures more explicit, by indicating that the products come from different sources; Figure 4b is too small, and essentially unreadable on paper. Remove the redundant "(a)" and "(b)" labels at the beginning of each caption.
- Page 5: Figures 6a and 6b are too small! Delete the redundant "(a)" and "(b)" labels at the beginning of the caption. Explain the content of these figures!
- Page 5, Lines 25 and 26: the content is very confusing. How do the two "virtual data models, GRAPH and TABULAR" mentioned on Line 25 differ from the two "data models, GRAPH and TABULAR" mentioned on Line 26?
- Page 5, Lines 31-34: "Figure Figure" -> "Figure". Explain the content of the figures! For instance, what do the "blue connections" in Figure 4a, Label 1 represent? How were they computed? That is, how do you decide which comments refer to which product, in absence of other information? Why are some nodes "grayed out" after the merge, and what does this denote? Add citations for "multi-join algorithm" and "incremental join algorithm". Finally, I fail to understand how "Figure 1, Label 2" should help clarifying that "the joining is through connections between star-shaped queries".
- Page 5, Line 43: "Sequerall" -> Squerall.
- Page 5, Lines 43-47: the provided details are not enough to understand the significance of your evaluation. Where do these 5150 queries come from, provided that the BSBM benchmark is made up of 12 (templates of) SPARQL queries? What is the relation between these 5150 queries and the 20 queries described in Table 1? They do not appear to be different "instantiations" of the 20 queries, because 5150 is not a multiple of 20. Also, if these 5150 are just variations of the same subset of (20?) queries, isn't your ML model just overfitting? What is the size of your data? How was data distributed across the different sources? For instance, was table 'Product' fully loaded within one source (which one?), or spread across several ones? How many times was each query executed? Were there warm-up runs? Etc.
- Page 5, Line 50: "...equal to the fixed one?" This sentence is broken. Rephrase.
- Page 6, Table 1: explain the content of this table, and maybe include here also the data source associated to each entity;
- Page 6, Line 12: "compared to the state-of-the-art" -> state-of-the-art in what, precisely?
- Page 6, Line 37: What is the "star" in BSBM*? Spell-out Berlin SPARQL Benchmark, and cite it! Also, in the sentence I read "how explained above", however it was actually never explained how the data was loaded (which table or portion of it went to which source(s)?).
- Page 6, Lines 39-40: where does this query Q21 comes from? It does not appear in Table 1! Is it also part of the "5150" queries? Also, I would not conclude from a single execution of a single query that "OPTIMA is able to support and join large data coming from different datasets". The query (shown in Table 3a) does not even look complicated: it is a single BGP query joining a product with its producer!
- Page 6, Line 42: I might be wrong, but "excels" can be used as a transitive verb only in a limited number of cases. I would use a more standard "outperforms", or something like that.
- Page 6, Line 43: "The time difference ranges from 0 to 80000 milliseconds" -> Actually, in Table 2, I see a difference from 6 to 6431 milliseconds, which is a whole order of magnitude lower than 80000. Could you clarify what you mean?
- Page 6, Line 47: specify what is the virtual data model adopted by Squerall.
- Page 6, Line 42: "As can be observed, OPTIMA excels Squerall for queries that involve multiple joins". I fail to see this, since the time difference between the two systems seems to be independent from the number of joins. For instance, queries Q2 and Q12 have respectively one and zero joins, however the time difference is quite large. Similarly, query Q4 is a join over all of the tables, however the time difference is non-existent. I could make similar comments with respect to the number of projections mentioned in Lines 45 and 46.
- Page 6, Lines 49-50: "table 3b" -> Table 3b. Also, Table 3a is totally unrelated to 3b. Why are they put under the same "Table 3" label?
- Page 7, Table 2: use something (e.g., the bold font?) to denote the "winning" system.
- Page 7, Line 7: this sentence seems broken.
- Page 7, Line 10: "table 5a" -> Table 5a.
- Page 7, Lines 10-11: how were the averages for GRAPH or TABULAR computed? Were they computed over the set of 1030 queries, over the 5150 queries, or over the sub-categories found by the classifier?
- Page 7, Lines 12-13: I cannot parse this sentence.
- Page 7, Lines 17-20: the information provided seems quite incomplete. Do the values refer to peaks or averages? Make this explicit!
- Page 7, Tables 3a, 3b, and 3c: the text in Table 3a is too small; Remove the redundant "(a), (b), (c)" labels from the beginning of the captions; align captions with tables!
- Page 7, Line 39: define the abbreviation "LSTM"; explain the significance of comparing your model against LSTM, also because the sentence from Line 45 to Line 47 seems to suggest that LSTM is indeed not a good model for your task (then, why compare against it?).
- Page 8, Table 4: the times indicated in this table seem to be inconsistent with those in Table 2. In particular, this table reports an execution time over GRAPH of only 1243 ms for Q4, whereas in Table 2 OPTIMA and Squerall have a comparable performance for this query. Could you explain? A similar issue happens for Q5. For Q6, instead, the least time is the one of GRAPH, which is higher than the one indicated for OPTIMA in Table 2. Are these inconsistencies due to the fact that you only ran against 1030 queries? If that is the case, why not to run against the full query set of 5150 queries, since comparing GRAPH to TABULAR has nothing to do with the test of the ML problem?
- Page 8, Lines 15-18: As for Table 2, also in this Table 4 the claimed execution time pattern for queries over GRAPH and TABULAR in relation to join/projections is not so clear. There are unexplained "exceptions", for instance, Q5. Better clarify that the boundaries in ms indicated in the text do not refer to the content of Table 4.
- Page 8, Line 29-30: is your guess based on filters actually grounded? Have you tried removing the filters and see what happens? This does not seem like a difficult check to do, so I do not see a reason why it should be left to guessing.
- Page 8, Line 31: given that no statistics about the size of the data has been provided, this sentence sounds very off.
- Page 8, Line 34: did you mean "non-selective"?
- Page 9, first and second item of the list: CSV and MySQL have the same relational structure. So, if the reason for the loading performance of MySQL is really the relational structure alone, I do not see why this should not hold for CSV as well. Provide a better explanation for what you observed.
- Page 9, Line 24: typically, OBDA approaches do not consider federation, whereas your sentence seems suggests that their focus is on source federation.
- Page 9, Line 28: missing ')' parenthesis.
- Page 9, Line 30: this sentence is slightly imprecise. Ontop, for instance, can query NoSQL sources through Teiid or Dremio (https://ontop-vkg.org/tutorial/federation/).
- Page 9, Lines 32-34: it should be mentioned that Ontop was the OBDA engine underlying Optique. Access to cloud was provided through Exareme (http://madgik.github.io/exareme/).
- Page 9, Lines 43-45: cost estimation of SPARQL queries is totally unrelated from ML. For instance, works like [Lanti et al. 2017] or [Bilidas et al. 2021] were not using ML techniques at all, but standard cost models built on statistics collected over the data source.

Review #2
Anonymous submitted on 29/Jan/2023
Suggestion:
Major Revision
Review Comment:

The article is about OPTIMA a query answering system over heterogeneous data sources. It is dedicated to big data and a data materialization approach, where the data required for answering query are transformed and loaded into a system evaluates the query. OPTIMA tackles the following question: which kind of system between TABULAR and GRAPH ones maximizes a given query evaluation on given data sources. OPTIMA uses a deep learning method to make this prediction.

* Remarks

** General
- The article is about the Ontology-Based Data Access, but it seems that it is not about ontology reasoning, so finally it is more focused on data integration for Semantic Web.
- The article presents an query answering system for heterogeneous data integration, which requires to present the mapping language from the data sources to the integration data model, in this case the RDF graph. Is it something like RML ? Section 2.3 presents the data wrappers, which have static way to transform the data from the data source into TABULAR or GRAPH data model. Does it means that there is no mapping language at all ?
- The data model prediction is based on the training over one data sources instance. What happen if the data source are updated ? What is the among of work that needs to be performed in order to keep the predictions up to date.
- The question raises by the article can be phrased as follows "what is the best data model to materialize heterogeneous data sources and then to efficiently evaluate a query on them ?". It seems to me that it is a quite theoretical question, where the schema used to materialize the data plays a big role whatever the chosen data model. So, the choice of the data wrappers have to be motivated, since they determine the schema of the materialization. And, the focus of the question on the data model has to be better justified. At the end, it seems that the problem is more about choosing the better system to store the materialized data in relation to the kind of query each system is optimized for than choosing the better data model. So, I think that the articles should not talk about the best data model, because it is not what is about, but it should talk about the best query execution system either a system dedicated for TABULAR data or for GRAPH data.

** Experiment
- The experiment 1 uses BSBM, but there is no information about where the queries come from and what is the size of the dataset (or the scale factor). The number of answer for each query should be added to the Table 1. Table 1 contains on an strict subset of the BSBM relational schema tables, why ?
- The article presents a system for big data, but do not give any information about the size of the databases use in the experiment. It is very strange. It should also talk about the scalability of the system.
- What is the LSTM model ? The article should contain a reference about it.
- The dataset should be presented at the beginning of Section 3 and not in Section 3.1, since it is used for both experiment 1 and 2.
- The times for OPTIMA in Table 2 and the best times in Table 4 are different (see Q5). Why is it the case ?
- What is the query execution time, does it include the time used by the wrapper and the loading time ?

** Others

- In the Section about the experiment, when it is possible replace miliseconds by seconds (e.g. 3000 ms -> 3 s). It is easier to read and it is a gain of space.
- Page 6, there is BSBM*, is the star a missing footnote ?
- Table 1, Q21 is missing
- Page 9 : SAPRK -> SPARK
- You could have a look to the Obi-Wan OBDA system, which is a one of the few OBDA system that supports NOSQL data sources. And it can be also interesting to mention some work about polystore (e.g. ESTOCADA), where similar problems are studied.

Review #3
Anonymous submitted on 01/Feb/2023
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

This paper presents a tool Optima, a system for efficient querying of
large heterogeneous data. The main idea of this tool is to use machine
learning to predict a virtual data model that is optimal for a given
user SPARQL query. Two virtual data models are considered, GRAPH and
TABULAR. The whole pipeline can be described as follows.

(1) The input SPARQL query is analysed to detect relevant entities,

(2) The virtual model that is optimal for query evaluation is predicted

(3) The original data of each relevant entity is converted (wrapped) to
the predicted format

(4) All virtual views are joined by Optima into a single global virtual view

(5) Depending on the virtual data model, Optima makes calls to the
existing query processing tools Graphx and Apache Spark to answer the
query.

I find this to be a very interesting idea to use machine learning to
predict the optimal data model for the virtual view. However this
paper completely lacks any technical details to be able to understand
what exactly they are doing. Even though this paper has been submitted
to the tools and systems track, it is essentially the first paper
describing the approach implemented in Optima, so I would expect to
see such details. In particular,

-- everything is explained on a very high level (for that reason it
actually also reads very easily), there is no single formal
definition of any term.

-- it is not clear what the training process of the prediction model
is, what the training data is, what the inputs are, what the labels
are and how they have been obtained. Is it a classification
problem, GRAPH or TABULAR? What is the architecture of the
predictive model? What is estimation layer? Also, not clear if the
virtual model is predicted for each of the sub-plans/sub-queries or
for the whole query. In the former case, what happens if the
predicted model is different for different sub-queries? Section 2.1
and Figure 2 are not helpful in understanding these details. At
some point later there is a reference to an LSTM model, which again
is not clear how it was obtained.

-- there is no formal guarantee that all transformations are sound and
complete and that the whole process is guaranteed to return correct
query answers. There are many things that could go wrong, during
data transformation or during distributed query processing. How do
we know that you do not miss some answers? Or that you do not
return spurious answers? The fact that the answers are the same as
for the baseline does not sound like a formal proof to me.

The experiments show that Optima is better than Squerall on a number
of queries, that I understood are more efficiently evaluated over the
GRAPH Virtual Data model, on the other queries with comparable
performance Optima and Squerall are essentially doing the same thing
(i.e., over the TABULAR data model). The times reported in Table 2
seem to include the Data wrapper time, but I did not understand over
which data sources they are. I also do not understand what the impact
of data transformation on the overall query answering time is. I would
expect that converting relational data into graph format, while saving
on query processing time, should require some computational resources.

Another set of experiments compares the learned model with some other
LSTM model, but it is not clear at all what that other model is.

I had a very quick look at the implementation and it seems to be just
a number of scripts that only work for the scenario described in the
paper. For a Systems paper I would expect to see less hard coded
examples and more of proper abstractions.

As for related work, there has been work on Generalized OBDA framework
that allows querying non-relational data sources using SPARQL.

All in all, I find this paper to have some very interesting ideas, but
in its current form I do not think it is ready to be published. To summarise:

-- everything is explained very superficially, it is not clear what is
the actual framework that underpins the system is, what it is
formally doing.

-- the experiments are not explained properly

-- the tool is not really a tool, it is rather a proof-of-concept
implementation hardcoding the scenario considered in the paper


Comments

This paper was submitted for consideration in the Special Issue: "Tools & Systems ” and “Submission type" should be Tool/System Report.

Thanks, this has been corrected.

Pascal.