Abstract:
Nowadays new graph-oriented database engines (GDBMS) are capable of managing large volumes of graph data. These engines, like any other database engine, have an internal query optimizer that proposes a query execution plan (QEP) for each query they process. This step is entirely necessary since without it queries to large amounts of data may need hours to terminate. Selecting a good enough query plan is a difficult task as there may be thousands of possible query plans to consider just for a single query. The development of a good query planner is considered the ``Holy Grail'' in database management systems.
Recent results show how neural network models can help in the optimization process in Relational Database Management System (RDBMS), however, performance prediction in the area of graph databases is still in its infancy. There are two challenges in predicting graph query performance using machine learning models, yet to be addressed: the first is to find the most suitable query representation in a format that can be interpreted by Machine Learning (ML) algorithms; the second (which depends on the first challenge), is the learning algorithm used to predict performance. Herein we propose a method that actually fills in such a gap, by providing first a query characterization that (second) allows us to develop a deep learning model to estimate query latency later on. We adapt an existing model using convolutions over trees to generate a model that SPARQL queries and evaluate it against Wikidata, using their query logs and data. Our model correctly predicts the latency for queries in 87% of the validation set and we provide an explanation to those queries that we fail to predict the latency.