SPARQL query execution time prediction using Deep Learning

Tracking #: 3336-4550

Authors: 
Carlos Buil Aranda
Daniel Casals
Carlos Valle

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
Abstract: 
Nowadays new graph-oriented database engines (GDBMS) are capable of managing large volumes of graph data. These engines, like any other database engine, have an internal query optimizer that proposes a query execution plan (QEP) for each query they process. This step is entirely necessary since without it queries to large amounts of data may need hours to terminate. Selecting a good enough query plan is a difficult task as there may be thousands of possible query plans to consider just for a single query. The development of a good query planner is considered the ``Holy Grail'' in database management systems. Recent results show how neural network models can help in the optimization process in Relational Database Management System (RDBMS), however, performance prediction in the area of graph databases is still in its infancy. There are two challenges in predicting graph query performance using machine learning models, yet to be addressed: the first is to find the most suitable query representation in a format that can be interpreted by Machine Learning (ML) algorithms; the second (which depends on the first challenge), is the learning algorithm used to predict performance. Herein we propose a method that actually fills in such a gap, by providing first a query characterization that (second) allows us to develop a deep learning model to estimate query latency later on. We adapt an existing model using convolutions over trees to generate a model that SPARQL queries and evaluate it against Wikidata, using their query logs and data. Our model correctly predicts the latency for queries in 87% of the validation set and we provide an explanation to those queries that we fail to predict the latency.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/May/2023
Suggestion:
Reject
Review Comment:

# Summary

This work studies how to predict the execution time of SPARQL queries using a specific application of deep learning.
In particular it uses a feature representation based on the presence of algebra operators on the one hand and on the structural similarity of the query to previous queries.
The features are exploited in conjunction with tree convolution deep neural network.
This work is an extension of a previous work where in instead a simpler DNN architecture was proposed.
The present work tackles an important problem, but has a series of unstated assumptions, important methodological issues, and poor experimental evaluation.

# Strong points:

S1+ the task of predicting the query execution time of a query is an important challenge, w.r.t. SPARQL queries, it is also a currently area that is not so well explored

S2+ This work tries to exploit as much as possible important lessons learned from the literature in RDBMS

S3+ Experiments adopt real queries and the analysis also investigates those cases where the present solution fails to perform, this demonstrates the current research direction needs stronger re-orientation

# Weak points:

W1- The work does not clearly state the end-goal of the work. While there is a vague hint that this work could be important for building intelligent query optimizers, the current solution is *by design* inappropriate in that regard.

W2- The work is based on the assumption that past queries are sufficient to describe future queries, thus the work needs to be retrained at every step in which the graph or the query workload changes.

W3- Experimental evaluation fails to validate the appropriateness of the data selection and to inspect systematically the performance of the proposal across queries of different nature.

# Detailed Comments:

D1. This work fails to properly and clearly motivate and explain the problem is addressing. It is missing a proper problem definition, this is also connected to the fact that goal of the prediction is left implicit.

D2, Connected to the above, the abstract and introduction hints towards the area of learned query optimizers. Which is an important area. Yet, this work does not, at any time, propose a solution applicable in that regard. This work assumes as input the SPARQL query and at best *one version* of the entire logical plan. The system is designed such that two different logical plans will always collide and be represented in the same way. Further, it does not consider in any way any physical plan or operator. Importantly, at some point it is hinted that multiple nodes in the plan will be translated to SCAN operations, but in triplestores the majority of joins will be index-nested loop joins, so a number of leaves will be actually not be executed as SCANS. See {A} for an in-depth discussion of physical query execution in triplestores and how different query operators are affected.

D3. In the introduction the paper first declares that there is no solution using DL for the task and few paragraphs later point to the fact that this work is an extension of the previous work. This is then a blatantly false statement.

D4. Section 2.2. seems to only have the effect to clearly explain why the current work is not appropriate for the task of query plan optimization. Also I believe the fig1 to be not relevant to the paper and I doubt [10] to be the canonical reference for the architecture which is a classical textbook DBMS architecture.

D5. In sec. 3.3 it is mentioned that this work refers to [7] when it comes to selecting query centroids, which in [7] are based on query templates. The assumption that all queries derive from query templates is a strong and unrealistic assumption. It is unclear wether this work steers away from this methodological issue. Nonetheless, computing edit distance to cluster queries is an extremely expensive step, which is rarely affordable, this work does not comment on this issue, nor on the fact that it is important to select an appropriate method to select how many clusters to extract. The same should be reflected in the experimental evaluation

D6. In sec.4 for LIMIT it is stated the numeric value is used as feature. Is it scaled? How?

D7. in sec. 2 it is used the definition of BGP but not the acronym, after that the acronym is used without being explanied. Further, later on there is a mentioned different between triple pattern join and BGP join and the difference is not explained.

D8. It is not clear how exactly are built the the final feature vectors. A more detailed and formal analysis of the entire end-to-end pipeline, from query to actual dimensions and number of feature vectors should be presented. Also, the algorithm to build the binary tree is not described. How this tree represent UNION, FILTERS, AGGREGATES ?
It’s unclear what features are used for the “implementation plan level” features.
it states “We enrich each node’s information by adding node semantic information such as the estimated cardinality for the node according to the optimizer, cost information, etc”. The “etc” should be elaborated, either in the experiments section or in the methodology section.
Further, the applied feature engineering approach is not elaborated on. The sentence “This information is provided by the query optimizer, and also introduces a high dimensionality in the node vectors due to the fact that the databases are composed of thousands of predicates” indicates a predicate-level encoding but “the how” is not elaborated.

D9. In Sec 4.1 an important methodological issue arises: cardinality estimation is based on the engine, it is unclear how the engine computes it (in Jena there is an additional functionality to turn-on, but here the details of which are not described). Further, the cardinality estimation is based on predicates alone, and it is clear (as results in the experiments) that this a poor choice, since there are indeed large hubs and the current model does not take them into account.

D10. Sec. 4.1.2 "hundred of predicates may be present" Where? Are they in the query log?

D11. The dataset excluded queries using "service", this is a poor choice, how many queries have been removed then? A good portion of SERVICE queries in WIkidata are for the label service, one can easily write a script to only keep the 'good parts' of the query without throwing away so many of them. It is not specified in which period the log files was generated.The explanation of train/validation/test split is unconventional. Usually one expresses the full split as 80-10-10, or 60-20-20. Section A.1 does not describe the process of filtering. Further it is unclear if there is any information leakage (same queries across splits). Also, it is unclear wether there is a compatible proportions of query operators across splits and a compatible proportion of query run-times. Finally, queries that return empty resultset should not be discarded.
How and under which conditions the query latency is collected is unspecified, i.e., which triple store, hardware stats, etc. are used.
In practice, the current query workload is completely unreliable.

D12. The method performance should be divided into different query runtimes, e.g., relative error across fast, normal, and slow queries. The discussion of the ablation study is almost completely absent. The contribution of the different components, as well as of the different choices (e.g., normalization of features, number of clusters, etc) should be discussed.

D13. The work should consider testing also on a second dataset and query workload. Eg., something from LSQ {B}.

D14. The results against the baseline model, i.e., the SVM-based approach is not presented in core of the paper. It’s recommended to emphasize on this and add the later work (Learning-based SPARQL query performance modeling and prediction) as another baseline.

D15. For the metric on Figure 7, the decision on thresholds are not explained, e.g., why is 0.2 the threshold for a “good” prediction.

D16. As the Introduction motivates this problem as a component for query planning, the inference time would be interesting to also look into and the size of the model.

D17. The work Multi-metric Graph Query Performance Prediction {C}, is not mentioned. As the scope of this work is on graph query performance prediction, there needs to be proper justification for how this work is different from the proposed contributions. Otherwise, this work should be a baseline for evaluating the performance as well. This work is additionally published later than the work mentioned in the Introduction as being the latest work within graph query performance prediction.

## Presentation issues.
- Do not use references as part of sentences, they should be used as annotation.
- sec3.1 "in despite"
- sec 3.2 Nu-SVR and KCCA are presented without reference
- bgp in lower case instead of uppercase in multiple instances
- Kgp sometimes K_gp
- Sec 4.1.2 Fig??
- Fig. 6 axis labels are missing
- Fig. 7 axis labels are not in english
- The reference section contains uncountable issues in formatting

There are sentences in the work that are vague, use incorrect terminology, or not specific enough to grasp the correct meaning behind it.
- in the introduction, “For query prediction in graph databases, the latest results ...”. Query prediction and query performance prediction are different challenges.
- in the Introduction “Finally, the gap in performance prediction hinders progress in optimizing SPARQL queries and other applications mentioned above”.
- the triple pattern definition for SPARQL uses blank nodes in place of variables which is not the standard and correct formal definition of triple patterns

-In the query level features, it unclear whether it’s the number of triple patterns or the number of basic graph pattern or both that are used: “We also include as query features the number of triple patterns (bgps) in the queries and the depth of the tree”.

- There are arxiv references instead of the reference to the canonical peer-reviewed publication.
- E.g., “Learned Cardinalities: Estimating Correlated Joins with Deep Learning” is published in CIDR.

- In section 4.2, it’s mentioned that this work will not learn partial query plans as “since that research is bounded to latency prediction”. It’s unclear how the latency prediction task differs from the performance prediction task in this work.

- The readme attached to the code on the github repo is missing clear steps for installation

{A} Sagi, Tomer, et al. "A design space for RDF data representations." The VLDB Journal 31.2 (2022): 347-373.

{B} Stadler, Claus, et al. "LSQ 2.0: A linked dataset of SPARQL query logs." Semantic Web Preprint (2022): 1-23.

{C} Sasani, Keyvan, et al. "Multi-metric graph query performance prediction." Database Systems for Advanced Applications: 23rd International Conference, DASFAA 2018, Gold Coast, QLD, Australia, May 21-24, 2018, Proceedings, Part I 23. Springer International Publishing, 2018.

Review #2
Anonymous submitted on 16/Jun/2023
Suggestion:
Major Revision
Review Comment:

1. There is a substantial repetition between the abstract and introduction sections that detracts from the paper's engaging introduction (Page 1, line 40 until the end of Page 1).

2. A LaTeX error is observed on Page 2, line 30.

3. Although the presented work is an extension, the contribution of the authors compared to previous research is not adequately explained. It would be beneficial to briefly mention the previous architecture or provide a quantifiable measure to demonstrate the improvement achieved by the proposed approach.

3. Please review Page 3, line 11, as Figure 2.2 seems to be referenced incorrectly. Presumably, Figure 2.b is intended.

4. On Page 3, line 14, the term "internal representations" lacks clarification. It would be helpful to provide an explanation of what these representations entail.

5. For Figure 2.b, there is a reference to a numbering system, but the graph does not display any numbers.

6. The motivation for this paper appears to be the interest of the RDBMS community in machine learning within DBMS. It would be beneficial to provide an example in the introduction, illustrating a query that is challenging to predict accurately using traditional methods, while the proposed approach may offer improved performance.

7. On Page 3, lines 19-20, the meaning of "the nodes to send the query" is unclear. Please provide further clarification.

8. The example provided on Page 3, lines 26-36, requires additional elaboration. The focus is primarily on birthplace and birth date, along with triple patterns, rdf:type, filter operators, and the optional operator. Please provide a more detailed explanation of this example.

9. There seems to be a discrepancy between the statement on Page 3, line 3, which mentions a focus on SELECT join, and the example, which contains the aforementioned operators. This inconsistency is confusing.

10. On Page 3, line 35, it would be helpful to provide the full sentence or context regarding the mentioned SPARQL query.

11. Please explain the rationale behind choosing K-medoids instead of other clustering algorithms. Clarify this point on Page 7.

12. On Page 7, line 2, there is a capitalization error in the phrase "In the next Section."

13. The reference to Figure ?? on Page 7, line 21, requires correction to accurately refer to the figure in question.

14. Please explain the meaning of VaR_UrI_var or Var_UR_URI. Additionally, ensure that any abbreviations are properly introduced and defined for readers who may not be familiar with them.

15. Provide a clear definition of "bounded variable" as mentioned on Page 7, lines 45-47.

16. The sentence "The query engine by first select" on Page 7, line 43, needs revision for clarity and accuracy.

17. On Page 7, line 49, it is unclear which predicates have lower selectivity. Consider revising the statement for clarity. It may be necessary to clarify if it refers to high or low selectivity.

18. Please provide a detailed explanation of how node semantic information is computed, as mentioned on Page 7, line 30.

19. Prior to utilizing $P^(i)$ and $P^(f)$, it would be helpful to define these variables in the preliminary section, as mentioned on Page 8, line 2.

20. The meaning of numbers from c1 to c4 in Figure 3 requires explanation.

21. Figure 4 appears to be similar to Figure 5 in Markus's paper, with slight differences in colors and additional detailed shapes. Provide a clearer figure that emphasizes the distinctions from the original paper (Neo:Neo: A Learned Query Optimizer).

22. While it is apparent that the authors aim to introduce learned query optimizers to the graph database realm, the paper lacks significant contributions. Can you add a discussion section that explains and address the challenges faced in applying learned optimizers to SPARQL compared to structured databases?

23. Section 4.2.1 resembles the corresponding section in the original paper (Neo) and it was not easy for me to comprehend and I went to Neo paper to understand the proposed architecture. This section needs to be more concise and clear.

24. In Section 4.2.3, the authors mention the use of autoencoders for dimensionality reduction, which represents a novel aspect compared to the Neo paper. However, the authors do not explain their design choices in this regard. Additionally, it would be beneficial to cite papers that have employed autoencoders for dimensionality reduction.

25. Provide an explanation of the connection and order between Figure 4 and Figure 5.

26. The architecture appears to be the same as Neo, with the addition of an autoencoder to capture essential features of the numerous predicates in graph data. Is this understanding correct?

27. On Page 10, line 47, the mention of Bonifati et al.'s analysis of query logs from Wikidata and the prevalence of relatively simple queries consisting mostly of single triple patterns requires clarification. How did this reality impact the training process and the results of the proposed model? Have you tried different logs maybe synthetic ones?

28. Page 10, line 47, lacks clarity regarding the number and nature of queries involved in the model training. Provide statistical information about the query features and their properties for a better understanding. How did you make the selection in detail?

29. The GitHub link and Huggingface platform provided in the paper do not contain the queries utilized for training. It is recommended to make the queries available and easily accessible.

30. The paper lacks an analysis of how the learned query optimizer compares to traditional query optimizers in terms of performance. A comprehensive discussion and comparison with traditional query optimizers are necessary.

31. Additional experiments are required to demonstrate the quality of the optimizer with queries of varying shapes and features.

Reference:
1)Angela Bonifati, Wim Martens, and Thomas Timm. 2017. An analytical study
of large SPARQL query logs. Proceedings of the VLDB Endowment 11, 2 (2017),
149–161.

Review Summary:

---

Based on the comments provided, it is clear that the paper requires a major revision before it can be considered for acceptance. We have raised several important issues regarding clarity, organization, justification of contributions, and missing information. In order to address these concerns, the authors should consider the following:

1) Improve Abstract and Introduction: Remove repetition, ensure an engaging introduction, and provide a clear example illustrating the challenges of traditional query prediction.

2) Clarify and Elaborate: Address unclear terminology, explain the meaning of numbers in figures, provide additional clarification for examples and sentences, and define abbreviations.

3) Correct Errors and References: Fix LaTeX errors, correct figure references, and add missing numbering to Figure 2.b.

Explain Contribution and Rationale: Clearly explain the authors' contribution compared to previous research, quantify the improvement achieved, and provide the rationale for choosing specific algorithms.

4) Provide Additional Information: Elaborate on computation processes, clarify unclear statements, define variables, and explain the impact of query logs on training.

5) Conduct Comprehensive Analysis: Include a performance comparison with traditional query optimizers, conduct additional experiments with varying query features, and add a discussion section addressing challenges in applying learned optimizers to SPARQL.

6) Address Similarity to Neo Paper: The paper has a high similarity to the Neo paper, including the approach, architecture, and methodology with slight modifications to apply the approach in SPARQL rather than SQL. The authors should clearly highlight the distinctions and unique aspects of their work compared to Neo. It is important to avoid appearing as a mere replication of Neo and emphasize the novel contributions and advancements made in this paper. By addressing this concern, the paper can establish its own identity and demonstrate its originality and value beyond being perceived as a replication of the Neo paper. To summarize, In many parts, the paper seems incremental on top of the original Neo paper (see comments 21.,23.,26). The authors should justify more convincingly, how and why their work goes beyond a straight application of the Neo approach, worthwhile a journal publication, and what are different/additional challenges appearing in the RDF context, that do not apply to the original RDBMS setting.

Review #3
Anonymous submitted on 10/Jul/2023
Suggestion:
Major Revision
Review Comment:

Summary

In this paper, the authors develop a new deep learning architecture and query representation to estimate SPARQL query performances in terms of query latency.
Their query model is composed of two elements, an algebraic part and the graph pattern one. The first one considers the algebra tree of the SPARQL query, while the second considers the graph patterns present in the query. This part is created by assigning to each graph pattern a value related to their similarity to the medoids computed using the k-medoids clustering algorithms.
The architecture is an extension/modification of the one proposed by [13]. The one proposed by the authors aims at estimating the query latency from the complete query plan and it’s based on CNN. In addition, it makes use of autoencoders to reduce the dimensionality of the data.
They evaluate their approach using the query logs from Wikipedia and compare them with the NuSVR approach of reference [7].
Their approach can correctly predict (within 20% of the margin of errors) 87% of the query of the dataset. Interestingly is also their discussion on why their approach failed to predict the latency time of some queries.

Originality

The work looks original. While using machine learning to support query plan optimization has been investigated in relational databases, the problem in the graph context is still under-researched.

Significance of the results

The results look significant, I did not find any major flaws in the design of the approach. The architecture the authors used is an extension and modification of an architecture that appears in the state of the art.

However, the major problem is the lack of comparison with non-machine learning methods. While their approach outperforms the one presented in [7], there is no clue on how it behaves against classical methods. This is important because, while ML models may have better performances, usually they come with additional costs related to their training and deployment. Also, the authors do not mention what were the characteristics of the machine used for the experiment, nor do they report the performance of their method in terms of CPU, memory usage and time.

Here are some other comments related to the clarity of the article (that are not related to the writing itself)

“Experts studying query execution plans learn to recognize sub-optimal or good plans by tree pattern matching.” -> I think it needs a reference.

It’s not very clear how the graph pattern part of the query is created. The authors mention that - after creating the graph by merging the graph patterns - the step consists of “to make clusters from the queries using the k-medoids algorithm”. Which queries? Ideally, all the ones received before? Or the queries belonging to a training set?

It’s not clear how the implementation level information it’s utilized (the reference to the figure is broken, so I guess the figure is missing?)

Why is the comparison with the other method in the appendix? It should be a core part of the paper. If it’s the length of the paper of concern, I suggest the author put the details of the tuning of the hyperparameters. The contribution would be much more solid if the comparison with the baseline is done both in the results section (i.e., by comparing the performances using statistical tests) and in the discussion (i.e., going into the details as the authors do right now for their methods, also for the baseline). The same comment I think holds for the analysis of the autoencoders since it’s a core part of the architecture.

The discussion on why the system fails at predicting the latency time of the queries is interesting, but it would benefit from a more detailed explanation. For instance, I guess that an unsupported operator “breaks” the prediction because it doesn’t get encoded in the algebraic part of the query, but it is not clearly stated in the text. Similarly the other case. I suggest the author rephrase the “Single triple pattern queries” to highlight better that the problem is that their approach considers very similar two graph patterns that in reality match a different number of triples (and hence the very different estimation of the latency).

Quality of the writing

The article is generally well written, here I list some comments.

While the part that starts with “This work is the extension of the research presented in [4]..” it’s useful at review time, I suggest the authors either remove it or rephrase it and put it in the related work section.

Page 7, lines 21 - 22: broken figure reference

“k-medioids algorithm” -> “k-medoids algorithm”

Some references in the bibliography are broken (for example reference [15] )

In Section 5, “In Section A.1 we describe the process we followed
to (ii), extracting the query characteristics (i.e. extracting the features vector).”. Section A.1 refers to the appendix, and in particular to a section explaining hypothesis testing and nothing about extracting query characteristics.

In the “Missing query operators” paragraph, the “[..] That makes 738 queries not supported by our model, 53% of the queries in the dataset”, is misleading. 738 is the 53% of the query badly predicted that is left after the removal of the duplicate. Also, the title of the section itself is ambiguous, the problem is that the operators are not missing but are not supported.

Quality of the resources

While all the code and the data are available on the author's GitHub, I wasn’t able to run the ModelTreeConvSparql.ipynb notebook - that apparently, it’s the main entry point as it should both train and evaluate the models.

The readme could do a better job of explaining how to install all that is needed to run the experiments and to guide the users toward repeating the evaluation shown in the paper.