Review Comment:
Originality and significance of the results
-------------------------------------------
In the context of interactive SPARQL query construction systems, one may want to avoid suggesting query extensions that lead to queries without results. (The author claims that) the only way to achieve this efficiently is by means of precomputed indexes. The paper focuses on search techniques for finding good indices. Additionally, it provides an efficient method to compute cost and precision estimates, which are used to guide the search.
I have some concerns about the originality of the methods for addressing the problem as well as their significance.
The main contribution is four search methods to find good indexes. All methods start with an empty query and introduce one extension on each iteration until a maximum cost or another stop criteria is reached. The first method introduces the most frequent extension in the query log. The second introduces a random extension each time (this method is considered “for reference”). The third method introduces the extension with the lowest cost at each iteration. The fourth method considers solutions that are not the best direct neighbour: instead, it picks ten random configurations two steps away and picks the best of this set.
The problem of searching in very large search spaces is well-known in computer science, and it seems the main issue addressed in this paper. The proposed methods seem relatively naive and there is not much discussion arguing why they should be interesting (especially in comparison with well-known metaheuristic search techniques). The fourth method is the only one that attempts to overcome the problems of systematically making locally optimal decisions, which is a major concern in this setting.
I would have expected a deeper analysis and a literature review on search methods that are suitable for the problem addressed since this is the main contribution, but the literature review wasn’t very informative in this area. For instance, it seems natural to me that ideas from evolutionary algorithms or even simulated annealing would be promising in this setting, and if they are not it would be interesting to get an intuition why.
Another concern is that somewhere in the evaluation it is said that the main interest is in large queries, which justifies generating a separate query log. But why would the interest be on the seemingly infrequent large queries, rather than frequent short ones? What is the benefit of a system that does well on rare cases but its precision isn’t tested against the usual ones? The fact that this focus wasn’t mentioned before and the lack of explanation could give the impression that this was a post-hoc decision, motivated by the fact that finding good solutions for short queries seems to be easy and does not discriminate elaborate search methods from simpler ones.
The provided precision measure is calculated over the Query Log. Also, the search is in principle all the possible queries, but is then reduced to “what exists in the query log”. Since the motivation is to produce useful indexes for new queries, at least a brief consideration of the implications of these choices would be desirable. Would it make sense to split the log to test the performance with actual new queries, which is what the use case would use?
Again on the precision, there is a quick remark in a section called “related implementations” mentioning an evaluation of the precision and cost estimations. This was something that felt missing in the paper and, since the analysis exists, it should have been discussed in detail, especially since “the estimated cost is typically off by a factor of 2 compared to the true cost”. Also, it is unclear to me if the evaluation is against the actual precision or cost or the estimation functions that are also used during the search.
Quality of the writing
----------------------
I believe that the quality of the writing should be significantly improved to fit the standards of a well-ranked journal publication. I would encourage the writer to ask collaborators for friendly reviews before submitting the manuscript, especially if English is not their first language
Writing issues appear throughout the paper and include:
Grammar and spelling mistakes:
“Never will” instead “will never” in pg. 2.
The subject of the sentence is missing: “execute the resulting query and check if it returns nothing”
“... the constructed query is large and requireS many joints to be answered”
I.e. (e.g., in pg.3) should not be at the beginning of a sentence
“... a (un)productive query”, then “.. an (un)productive query” (pg. 4)
Several nested parentheses (pg. 5)
“fast enough NOT TO delay” (pg. 6)
“There are several ways to construct an index …, but our approach is based on”: BUT is used to introduce a phrase or clause contrasting with what has already been mentioned, so it is misused in this context
“... can not” should be “cannot” (pg. 8)
“The root column will in _ always only contain _” is incorrect, it can probably be replaced by something like “In table _, the cells of the root column can only contain _” (pg. 8)
“... but also”: the sentence should start with “not only” (pg. 9)
…
Errors in the examples
“Example: Productive Extensions”
Fig.1, borders and knows should be two-directional arrows (def of N)
d_1 is of type Integer
Fig 4, caption refers to unexisting queries
“Example: Index Construction”:
what is ansOpt ?
A row for Alice 16 is missing if I am correct
Fig 7.
I believe that an edge “knows” to a vertex “Person” is missing in Z_person
I believe that an edge “borders” to a vertex “Country” is missing in Z_country
* (I stopped verifying in detail the contents of the figures here)
In addition to actual errors, most graphics from Fig. 9 to Fig. 13 don’t show some of the points because of the representation or because they are out of range. Within the reasonable, it would be good to find a reasonable way to display everything.
ill-defined technical vocabulary usage
I would strongly encourage the authors to have a more formal presentation of section 2, including the use of numbered definitions and an explicit characterisation of their framework. The fact that things are introduced loosely and within paragraphs makes it time-consuming for the reader to figure out exactly what things are, and one has to reread full pages to try to find the definition of a reappearing symbol. There are small errors here and there, but these are hard to trace because of the lack of formal definitions. For instance, on page 3 it seems that the typing function T_q is intended to map vertices to types, yet it is then used also to map an edge to a type. In “Q_s is the largest subtree ..” (pg. 7): can’t there be several largest subtrees? Also, it seems implicit somewhere else that subtrees must preserve the same root. Since nothing is defined, it is hard to know if this is an error or if it was never made explicit. Overall, I believe that more explicit formal definitions would improve clarity and possibly save space.
Another example is in section 5 (pg. 14) with "W_m = {Q_t | t \in N}". N is not a set, it is a graph, so what is “t \in N”? Abusing notation one could accept t to be a vertex, but those can be classes and datatypes, not just classes. And what is Q_t? This is never explained. One gets an intuition from the example but nothing else. Then we have “W_r is the set containing one small configuration query for each property” (in the figure). From the examples, I deduce small is “of size 1”, but this should be specified. Later in the textual definition of “W_r” it mentions “...contains the values from Gamma_v that exists as targets of p”, what is p?
I would suggest being careful with the usage of certain vocabulary. Some examples:
I would avoid the abuse of superlatives and lose notions that are vague in their interpretation. Eg. “... requires deep knowledge about the ontology”, “The index grows very fast”, “it takes a lot of time” (Introduction, page 2). What is “deep knowledge” of an ontology? How much is “very fast”? exponential?
“the only way to guarantee sufficient efficiency is to build and use an index structure that is specialized for the task” (pg. 7). This sounds like a really strong claim to me. First, because what counts as “sufficient efficiency” is vague, and second because other researchers may come up with alternative strategies other than “building a specialized index for the task”. So I would refrain from this sort of assertion.
The problem description is said to be “finding THE optimal configuration”, but “optimality” is not defined in this context, it is only qualified as “high precision and low cost” (not even “highest/lowest”). Then it says: “there will for realistic setups never be only one single such optimal configuration”. Overall this is vague, which I would avoid at least to describe the main aim of the paper. Alternatively, the author could just define the problem as “to find pareto-optimal configurations” from the beginning and briefly explain what that is.
Questions:
Query Extensions and productive Extensions, page 4. While it is easy to check that a filterless conforms to N, it may still be unproductive, so it should still not be suggested as a productive extension. However, in this paragraph, this is said to be trivial. Could you explain?
“The root column will in ans_e(Z,D) always only contain X”: What if there are no instances of the class at the root?
Isn’t Gamma_v unnecessary in equation (2.7)?
Long-term stable URL for resources:
-----------------------------------
In light of the concerns expressed in this review and my recommendation to reject the paper, I have not further analysed the resources provided by the authors.
|