Review Comment:
I am reviewing the revised version of this paper, which already includes revisions in response to the first round of reviews. (I was not one of the first-round reviewers.) I have seen the authors' remarks in response to the first-round reviews, and the highlighted changes made by the authors. Generally speaking, it is clear that the authors have made a big effort to address all of the concerns expressed in the first-round reviews, and have done a good job of it. (The additional content on the evaluation, in particular, represents a big effort.)
The paper presents a method of representing user preferences, expressed in accordance with the CP-theory formalism, in RDF/OWL. The method then automatically generates a SPARQL query that uses the preferences representation to determine an ordering of possible outcomes of user queries, and inserts that ordering into a knowledge base. Subsequently, when a user query is received (which expresses the user's requirements, or hard constraints), the ordering information is used to rank the results that satisfy the user' s requirements. Additionally, the paper describes some experiments that evaluate the effectiveness of a tool for acquiring user preferences, and also experiments that said some light on the performance of the proposed method.
Overall, the paper is interesting and clever, and reasonably well-written. Adding preferences to SPARQL is definitely a worthwhile goal. This approach is distinctive in supporting CP-theories, and also distinctive in that no SPARQL extensions or special conventions for query formulation are needed. So it supports use cases that aren't supported by the previous work on SPARQL with preferences.
Although I am doubtful that this approach will ever serve as a blueprint for a practically-useful deployed system, nevertheless I agree with the authors' claim that this work provides "a good starting point to reason with preferences in a Linked Data environment as it is a straight implementation of theoretical results coming form previous works". In other words, I think this work provides some directions and building blocks which could provide support for further developments leading to a practically-useful deployed system. And even if it does not directly do so, I think the work is of sufficient theoretical / modeling interest to warrant publication. However, I think a few additional clarifications would improve the paper, and some additional work on readability.
I believe the paper should be more clear about limitations regarding domain size. Unless I'm missing something, it seems clear that the domain of each variable used in a CP-theory must be finite in order to be used in this approach (because the approach derives and asserts a score for all possible outcomes), and also should be reasonably-sized for performance reasons. This is a strong limitation, given that preferences can easily involve large domains or comparisons over infinite domains such as distance and time (for example, I might want to inform a map-search system that all things being equal, I prefer restaurant X, whose distance from my location is less than the distance to restaurant Y; or in searching an information source for temporal events, I might want to say that I prefer an earlier event to a later one). Given the fundamental nature of this limitation, it should be mentioned and, if there are ideas about addressing it, they should be mentioned as well.
Again, the approach involves deriving "the strict partial order over *all possible outcomes* that is determined by the CP-theory", which has crucial implications for feasibility and performance. I'm wondering – in future work, would it be possible to modify the approach so that the strict partial order is only derived as needed for each given query – and only for the items meeting the user requirements of the query?
I am uneasy about the following claim, which appears in the related work section: "[PrefSPARQL deals] with “simple” conditional preferences while our approach is able to manage CP- theories which result to be much more expressive even in their cuc-acyclic version we consider here". Although I cannot shed much light on the relative expressiveness of the 2 approaches, and I'm not an expert on CP-theories, I am concerned about this claim because (1) the paper doesn't explain what is meant by "simple conditional preferences", or how that differs from what is expressible in CP-theories; (2) the conditions and preferential expressions (x is strictly preferred to x') of CP-statements themselves are constrained to be assignments of particular values to variables, whereas PrefSPARQL allows a broader range of expressions, including, for example, comparison operators; and (3) the approach, as mentioned above, can only refer to variables having finite domains, which is not true of expressions in PrefSPARQL. I believe some justification is needed for the claim that CP-theories are much more expressive (or, if no justification is given, the wording of the sentence quoted above should be revised).
One of the chief differentiating claims of the approach – that it can "provide an order of all the available outcomes that reflects user preferences", and not just the skyline of non-dominated outcomes – is weakened somewhat by the following observations: (1) Siberski et al. (which you cite) proposed that a use of SPARQL's existing construct "LIMIT k", in combination with their proposed keyword PREFERRING, can inform the query evaluator to retrieve enough "levels of skyline" (which can be defined very naturally) to produce solutions numbering at least k; and (2) the prior SPARQL-specific work is oriented around expressing preferences, and retrieving undominated results, from a single query. If one considers using 2 or more queries executed in sequence, it would not be hard to extend the prior approaches to return additional results, going beyond the top skyline, and to rank them according to their skyline level. Please consider mentioning one or both of these observations, if it would be clarifying or helpful to the reader.
The ordering of outcomes by "the number of outcomes that they are able to... dominate", although easy to understand, probably isn't ideal from a practical perspective, in my opinion. It seems to me, intuitively, that it is more useful to rank an outcome by how close it is to being undominated (i.e., by what "level of skyline" does it belong to). In other words, outcomes should be ranked by increasing length of the longest path through more preferred outcomes. I'm not requesting any particular modifications to the paper here; just offering these observations for the authors' consideration.
I am not thrilled with the title; I think its wording is unnecessarily inaccurate. Unless I am missing something, the approach does not "embed" CP-theories as SPARQL queries. It represents CP-theories in RDF, and subsequently composes queries over those RDF statements. There is no point in the process in which a CP-theory is expressed by a SPARQL query. Perhaps a better title would be something like "Using RDF and SPARQL with CP-theories to reason about preferences".
---------
I realize the authors have already made significant improvements for clarity, but I recommend another careful proofreading for readability, aiming to be as helpful as possible to the average reader of the Journal, and keeping in mind that it's a complex and fairly densely written paper. Here are several specific observations about readability:
The "motivating scenario" section - Section 2, including Figure 1, is potentially confusing and non-helpful, because it doesn't describe or clearly illustrate the input and handling of the user's query! In particular, in Figure 1, the input of the user's query (expressing their "user requirements") is not illustrated. Also, the final step under the "Interact" heading vaguely refers to "a SPARQL 1.1 query able to retrieve and order resources by taking into account user preferences". Is not at all clear whether that refers to the query of step 1 in Section 4.2, or to the query of Step 2. Moreover, it's misleading to talk about "*a* SPARQL 1.1 query able to retrieve and order resources by taking into account user preferences", as if a single query does everything. I believe that Section 2 needs to be more accurately matched to the points of Section 4.2, and needs to do a better job of preparing the user for the various distinctions that are presented in Section 4.2. It's understandable if you want to keep section 2 brief and high level, but you could at least refine it to indicate to the user that some important additional steps/details will be described in the later section.
In section 4, just before Example 10, the "federated query, composed by 2 sub queries", is described in terms of what the subqueries do. I think it would be helpful to add a brief statement (even if it is only a single sentence or two) about how the federated query is generated, and/or what the generation is based upon.
In section 3, why are so many mathematical expressions inserted in the standard paragraph formatting, with no attempt to improve readability? Is this because of page constraints imposed by the journal? If so, well, okay, but if not – please think of the reader and make some formatting improvements for readability. At a minimum, I strongly recommend to break out the definition of CP-theory, and the definition of worsening swap. Also, it would be good to consider some formatting improvements in the final lengthy paragraph of Section 3.
Please note that there is a version of "Preference Queries with Ceteris Paribus Semantics for Linked Data" on semanticscholar.com which appears to be substantially different than your paper of the same title that was published in the OTM proceedings – even though the semantic scholar.com page states that the paper there was "Published 2015 in OTM Conferences". Judging from the abstracts, the difference between the 2 versions is very significant; i.e., the former deals with CP-nets whereas the latter deals with CP-theories. (Indeed, It appears that the version on semantic scholar is pretty close to the initial version that was submitted to SWJ.) It would be good to find some way to clear up this confusion as it could lead to wasted time of other researchers.
-----
A couple more specific suggestions for readability. These are both sentences that I stumbled over, which can easily be improved.
"However, both the approaches proposed in [40] and [27] have an important limitation: they are not able to provide an order of all the available outcomes reflecting user preferences and return only the undominated (a.k.a. Pareto-optimal)results,i.e., those outcomes best satisfying user preferences."
In the above, clarity can be improved as follows:
However, both the approaches proposed in [40] and [27] have an important limitation: they are not able to provide an order of *all* the available outcomes that reflects user preferences. That is, they return only the undominated (a.k.a. Pareto-optimal) results, i.e., those outcomes best satisfying user preferences.
---
"The query computes the set Θ0(α,β) by considering the variables X in the set ∆(α,β) for which it does not exist ..."
Here, I recommend simply changing "it does" to "there do":
The query computes the set Θ0(α,β) by considering the variables X in the set ∆(α,β) for which there do not exist ...
I hope these comments are helpful. I will be interested in any response-comments that you may have.
|