Review Comment:
This paper presents the system SPARKLIS, which allows users to interactive build questions over a selected endpoint. The approach is very interesting and practical as it combines auto-completion approaches and faceted search and takes them one step forward. SPARKLIS goal is to allow users to benefit from the expressivity of SPARQL in a user friendly way, while providing guidance and a overview over a dataset, and as such avoiding the habitability problem typical from NLI.
For example, one of the advantages of this approach with respect to previous work is that user can choose any position on the query (instance, class, property or an operator) to be used as a focus to obtain more suggestions and further refine the query. The approach also covers SPARQL features such as union ,negation, optional, filters, aggregations, ordering to build more complex queries (with superlatives, aggregations, etc). The results are verbalised in Natural Language. To alleviate the burden of creating a query from the user, the system present suggestions in the form of meaningful phrases.
The author points out that the approach covers SELECT and ASK queries, and that it doesn’t cover nested queries and aggregations -> could you add an example here and perhaps extend on why?
Moreover, as the author points out, only a limited number of suggestions can be presented to the users for the approach to scale. I was missing a discussion on the paper on how these suggestions are selected, i.e., how do you rank which results to present to the user?.
In this paper a new evaluation based on collected user logs through the online demo is presented. However, the paper has been already evaluated in previous publications using the QALD corpus and a usability study. While, you don’t need to present all the results again in here, I was missing a discussion on the limitations of SPARKLIS based on the current and previous evaluations. For instance, how often queries were unreachable because of partial results? How scalability affected usability? How do the users distinguish between out of scope queries or queries that can’t be build because a very large number of suggestions or not using the right word (synonyms)? This is one of the main issues on faceted and guided interfaces and it isn’t clear here how the authors tackle it for large and open domain endpoints such as DBpedia.
While I agree with the author that through a query building interface one avoids the need for complex NL understanding while reaching comparable expressivity, and that guidance avoid the ambiguity problem for small datasets. I am not convince it totally avoids the problems of disambiguation, as the user still needs to figure out how the knowledge and vocabulary is modelled in large datasources. The author states that suggestions could also be based on learning or user preferences in future work , but is there any intelligent ranking currently in use for this? is it limited to DBpedia categories or do you also include YAGO hierarchy? what about lexically related words?
The evaluation presented in this paper based on users logs shows some statistics on usage, a list of endpoints users selected, and a nice distribution of query sizes. For instance, in the example given “show me a drug that has a title..” it may have been non obvious for the user that he has to look for the relation “has a title” . The paper states untrained users need to first learn by trial and error to finally reach a user query (an example is given for a query of size 6 - 14 steps and less than 5 minutes). Besides just given an example, could you measure this? it would be a nice contribution from this paper (with respect to previous ones) to measure the average times and number of failed attempts according to the size or the learning curve (if not for all, at least for some of the latest logs in for example DBpedia). An extended evaluation and discussion on real users logs (with respect to benchmarks) could be very interesting. Also, it will be nice if the questions asked by users to the different endpoints and used for this evaluation could be published somewhere online.
Moreover, I would have like to see an extended discussion on the coverage of the type of queries this system could tackle (now or in the future, following a similar approach), for example, what about more complex spatial or temporal queries? or analytical or statistical queries which may required additional processing such as (just to come up with examples) “What is the most common cause of mortality across … ?” , “How many movies in xxx were released in xx year?” , “ what is the average..?” . Some examples to extend on the discussion will improve the understanding of the capabilities, limitations and potential of this kind of systems.
To sum up, the paper is nicely written, the topic it tackles is very relevant, convincing evidence is provided, but it can improve on coveying the limitations and challenges (e.g., ranking of suggestions, usability vs. scalability, ..) as well as by improving the discussion and analysis of results from the evaluation, which is why I ask for major reviews (although they are not that major, just an extension).
|