Generation of SPARQL Query Extension Indices Based on Query Logs and Data

Tracking #: 3317-4531

Vidar Klungre

Responsible editor: 
Guest Editors Interactive SW 2022

Submission type: 
Full Paper
In the context of interactive query construction systems used to iteratively build SPARQL queries from scratch, we consider the problem of detecting query extensions that lead to queries without results, called unproductive extensions. To do this, the system simply needs to check if the resulting query returns no answers, but this may be an inefficient process, especially if the query is complex because it then requires the calculation of multiple expensive joins. This problem can be solved by setting up an index structure to efficiently answer the relevant queries, but since there is no bound on the size and complexity of these queries, it is impossible to make such an index with finite space. This can be solved by building an index that only covers the most common query patterns. This leads to a smaller, finite index, but may cause some inaccuracy in the detection of unproductive extensions. In this paper, we explore the typical trade-off between precision (ratio of detected dead-ends) and cost (required storage space) of such indices, and we present techniques that produce indices with low cost without sacrificing too much precision, given a dataset and a representative query log. This is a challenging problem: for all realistic datasets and query logs, the search space of possible indices is too large to cover completely. Additionally, calculating the cost and precision of a single index is in itself computationally expensive. We propose a solution to this problem where efficient cost and precision estimates are used to guide the search for optimal indices. Our evaluation, which uses an extensive benchmark based on Wikidata, shows that our technique is able to efficiently compute non-trivial configurations with both high precision and low cost.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Zhangcheng Qiang1 submitted on 28/Dec/2022
Review Comment:

This paper explores the four different search heuristics (Random, Greedy Query Weight, Greedy Precision, and Exploratory) for generating the SPARQL query extension indices based on query logs and data. The trade-off between precision (ratio of detected dead-end) and cost (required storage space) is examined. The paper is well-designed and well-written. The author clearly explains the problem, designs and implements the experiment, and justifies the results. My major concern is the selection of two different query logs in the evaluation section. One of the query logs the author chose is "Queries with 2 or more variables and weight >= 10.0". As far as I know, indexing works more efficiently for complex queries but is less powerful for simple queries. I wonder why the author chose 2 as the threshold for evaluating the indexing. A potential solution is adding no-indexing into the evaluation and finding a threshold when indexing performs better. The paper would be more convincing if the author could provide more details on the query log selection and add no-indexing as a reference in the experiments.

Review #2
Anonymous submitted on 12/Jan/2023
Minor Revision
Review Comment:

This paper focuses on improving the efficiency of ontology-based interactive SPARQL query construction systems. It proposes four search methods for building indices that aim to provide precise recommendations at a low cost; and two estimations for methods for calculating the cost and precision. These methods are evaluated using a dump of Wikidata.

In general the paper tackles an interesting issue and is well presented and structured. All resources are available and accessible online. Section 2 describes clearly and in detail the basis upon which the rest of the paper builds. The quality of writing is fine, but feels rather coloquial at times.

My main concern revolves around the evaluation, how it is presented and the impact of the contributions:
* What is the importance of evaluating with two query logs instead of aggregating everything in one? There is no discussion about differential results between the two
* The “weight” of the queries is mentioned several times, and becomes an important metric as the evaluation advances, however I find no definition in the paper, or how it is extracted/calculated.
* Why use the 2015 version of Wikidata? There are several more actual dumps
* Is there a particular reason why LB (Sec. 6.4) is run before LA (Sec. 6.5)? In page 22 they are presented in the inverse order of the sections, I suggest following the same order for both parts.
* I find Figures 10-13 confusing, the configurations (represented in diamonds) don’t seem to appear much, and it is not explained in the text either why are in the same diagram as the methods

In more general lines regarding impact, I miss a couple of things: a comparison with existing methods, to check if the proposed ones really suppose an improvement; and the implementation on a real scenario to see how they perform on a real-time query construction scenario, which is the end goal of the contributions. Are these issues considered part of the future work? Moreover, the paper can benefit from an additional section discussing the results’ impact, or extending the last section with this issue.

Minor issues and typos:
* In Section 5.2, examples are presented for a couple of configurations: it would be nice to extend the examples to all configurations
* Page 17, paragraph 4: Swap “Figure 3” and “Figure 8”?
* Page 26, line 2: immedeately -> immediately

Review #3
Anonymous submitted on 07/Mar/2023
Review Comment:

The paper proposes a solution to balance precision and cost of producing indices for preventing unproductive extensions problem (queries that have no match). This problem can be alleviated by using pre-computed index. But the cost of the index (size especially for large domains) and precision should be taken into account. This work proposes a heuristic-based search methods to find high-quality indices.

- Although the paper proposes a solution to the unproductive extensions problem, the solution itself is not revealed in the abstract. Please clearly explain the contribution of the proposed work in the abstract.
- Similarly, author mentions of a heuristic-based search methods for creating high-quality indices but they are not explained in the introduction. Details can only be understood later sections of the manuscript. It is understood that the author use four heuristics such as greedy query weight, random method, etc. Therefore, contribution of the paper should be clearly explained both in the abstract and introduction.
- Furthermore, the work is the extension of references 4, 5, 6, and 7. The difference of the current submission from them should be clearly specified.
- In addition, the structure of the paper can be improved; related work section can be after the introduction. Currently it is between definition (section 4). In addition, the difference of the proposed work with respect to the related work should be discussed.
- Problem description section is very short and it can be integrated into the introduction that also emphasize the importance of the work.
- Search method section can also be expanded which is the main contribution of the paper that the author uses for optimization of indices. In particular, how these methods are applied for query log estimation for SPARQL queries? Examples can be given to demonstrate this.
- In the evaluations, it is difficult to follow the discussions. It would be nice to provide an acronym table for the methods.
- Did you compare the proposed work with any relevant related work? It would be good to demonstrate some quantitative evaluations.

Overall, the contribution of the paper is not explained clearly, as well as, general structure of the paper is difficult to follow. The authors need to revise the paper.

Review #4
Anonymous submitted on 18/Apr/2023
Review Comment:

Originality and significance of the results
In the context of interactive SPARQL query construction systems, one may want to avoid suggesting query extensions that lead to queries without results. (The author claims that) the only way to achieve this efficiently is by means of precomputed indexes. The paper focuses on search techniques for finding good indices. Additionally, it provides an efficient method to compute cost and precision estimates, which are used to guide the search.

I have some concerns about the originality of the methods for addressing the problem as well as their significance.

The main contribution is four search methods to find good indexes. All methods start with an empty query and introduce one extension on each iteration until a maximum cost or another stop criteria is reached. The first method introduces the most frequent extension in the query log. The second introduces a random extension each time (this method is considered “for reference”). The third method introduces the extension with the lowest cost at each iteration. The fourth method considers solutions that are not the best direct neighbour: instead, it picks ten random configurations two steps away and picks the best of this set.

The problem of searching in very large search spaces is well-known in computer science, and it seems the main issue addressed in this paper. The proposed methods seem relatively naive and there is not much discussion arguing why they should be interesting (especially in comparison with well-known metaheuristic search techniques). The fourth method is the only one that attempts to overcome the problems of systematically making locally optimal decisions, which is a major concern in this setting.

I would have expected a deeper analysis and a literature review on search methods that are suitable for the problem addressed since this is the main contribution, but the literature review wasn’t very informative in this area. For instance, it seems natural to me that ideas from evolutionary algorithms or even simulated annealing would be promising in this setting, and if they are not it would be interesting to get an intuition why.

Another concern is that somewhere in the evaluation it is said that the main interest is in large queries, which justifies generating a separate query log. But why would the interest be on the seemingly infrequent large queries, rather than frequent short ones? What is the benefit of a system that does well on rare cases but its precision isn’t tested against the usual ones? The fact that this focus wasn’t mentioned before and the lack of explanation could give the impression that this was a post-hoc decision, motivated by the fact that finding good solutions for short queries seems to be easy and does not discriminate elaborate search methods from simpler ones.

The provided precision measure is calculated over the Query Log. Also, the search is in principle all the possible queries, but is then reduced to “what exists in the query log”. Since the motivation is to produce useful indexes for new queries, at least a brief consideration of the implications of these choices would be desirable. Would it make sense to split the log to test the performance with actual new queries, which is what the use case would use?

Again on the precision, there is a quick remark in a section called “related implementations” mentioning an evaluation of the precision and cost estimations. This was something that felt missing in the paper and, since the analysis exists, it should have been discussed in detail, especially since “the estimated cost is typically off by a factor of 2 compared to the true cost”. Also, it is unclear to me if the evaluation is against the actual precision or cost or the estimation functions that are also used during the search.

Quality of the writing
I believe that the quality of the writing should be significantly improved to fit the standards of a well-ranked journal publication. I would encourage the writer to ask collaborators for friendly reviews before submitting the manuscript, especially if English is not their first language

Writing issues appear throughout the paper and include:
Grammar and spelling mistakes:
“Never will” instead “will never” in pg. 2.
The subject of the sentence is missing: “execute the resulting query and check if it returns nothing”
“... the constructed query is large and requireS many joints to be answered”
I.e. (e.g., in pg.3) should not be at the beginning of a sentence
“... a (un)productive query”, then “.. an (un)productive query” (pg. 4)
Several nested parentheses (pg. 5)
“fast enough NOT TO delay” (pg. 6)
“There are several ways to construct an index …, but our approach is based on”: BUT is used to introduce a phrase or clause contrasting with what has already been mentioned, so it is misused in this context
“... can not” should be “cannot” (pg. 8)
“The root column will in _ always only contain _” is incorrect, it can probably be replaced by something like “In table _, the cells of the root column can only contain _” (pg. 8)
“... but also”: the sentence should start with “not only” (pg. 9)

Errors in the examples
“Example: Productive Extensions”
Fig.1, borders and knows should be two-directional arrows (def of N)
d_1 is of type Integer
Fig 4, caption refers to unexisting queries
“Example: Index Construction”:
what is ansOpt ?
A row for Alice 16 is missing if I am correct
Fig 7.
I believe that an edge “knows” to a vertex “Person” is missing in Z_person
I believe that an edge “borders” to a vertex “Country” is missing in Z_country
* (I stopped verifying in detail the contents of the figures here)
In addition to actual errors, most graphics from Fig. 9 to Fig. 13 don’t show some of the points because of the representation or because they are out of range. Within the reasonable, it would be good to find a reasonable way to display everything.
ill-defined technical vocabulary usage
I would strongly encourage the authors to have a more formal presentation of section 2, including the use of numbered definitions and an explicit characterisation of their framework. The fact that things are introduced loosely and within paragraphs makes it time-consuming for the reader to figure out exactly what things are, and one has to reread full pages to try to find the definition of a reappearing symbol. There are small errors here and there, but these are hard to trace because of the lack of formal definitions. For instance, on page 3 it seems that the typing function T_q is intended to map vertices to types, yet it is then used also to map an edge to a type. In “Q_s is the largest subtree ..” (pg. 7): can’t there be several largest subtrees? Also, it seems implicit somewhere else that subtrees must preserve the same root. Since nothing is defined, it is hard to know if this is an error or if it was never made explicit. Overall, I believe that more explicit formal definitions would improve clarity and possibly save space.
Another example is in section 5 (pg. 14) with "W_m = {Q_t | t \in N}". N is not a set, it is a graph, so what is “t \in N”? Abusing notation one could accept t to be a vertex, but those can be classes and datatypes, not just classes. And what is Q_t? This is never explained. One gets an intuition from the example but nothing else. Then we have “W_r is the set containing one small configuration query for each property” (in the figure). From the examples, I deduce small is “of size 1”, but this should be specified. Later in the textual definition of “W_r” it mentions “...contains the values from Gamma_v that exists as targets of p”, what is p?
I would suggest being careful with the usage of certain vocabulary. Some examples:
I would avoid the abuse of superlatives and lose notions that are vague in their interpretation. Eg. “... requires deep knowledge about the ontology”, “The index grows very fast”, “it takes a lot of time” (Introduction, page 2). What is “deep knowledge” of an ontology? How much is “very fast”? exponential?
“the only way to guarantee sufficient efficiency is to build and use an index structure that is specialized for the task” (pg. 7). This sounds like a really strong claim to me. First, because what counts as “sufficient efficiency” is vague, and second because other researchers may come up with alternative strategies other than “building a specialized index for the task”. So I would refrain from this sort of assertion.
The problem description is said to be “finding THE optimal configuration”, but “optimality” is not defined in this context, it is only qualified as “high precision and low cost” (not even “highest/lowest”). Then it says: “there will for realistic setups never be only one single such optimal configuration”. Overall this is vague, which I would avoid at least to describe the main aim of the paper. Alternatively, the author could just define the problem as “to find pareto-optimal configurations” from the beginning and briefly explain what that is.

Query Extensions and productive Extensions, page 4. While it is easy to check that a filterless conforms to N, it may still be unproductive, so it should still not be suggested as a productive extension. However, in this paragraph, this is said to be trivial. Could you explain?
“The root column will in ans_e(Z,D) always only contain X”: What if there are no instances of the class at the root?
Isn’t Gamma_v unnecessary in equation (2.7)?

Long-term stable URL for resources:
In light of the concerns expressed in this review and my recommendation to reject the paper, I have not further analysed the resources provided by the authors.