Review Comment:
This manuscript proposes an approach to answer specific types of SPARQL queries using a differential privacy guarantee.
The problem that this work aims to address is highly important and, to the best of my knowledge, it is the first work in this direction. The text is generally okay to understand but there are many small inaccuracies and formal mistakes (details below), which leaves a bad first impression. While these mistakes may be easy to fix, the presented work has two major weaknesses that require a major revision of the manuscript.
W.1: The first major weakness is that the evaluation is designed in a way that makes it unsuitable to evaluate the presented approach (see point W.1.1 below). Additionally, important information about the evaluation are missing and there are inaccuracies in the description of the evaluation (see point W.1.2 below).
W.2: The second major weakness is that the core of the proposed approach is a technical result (Theorem 3 together with Lemma 4) that is not proven sufficiently. An initial problem here is that the formal definition of the proposed approach itself is full of inaccuracies and small mistakes (see point W.2.1 below). On top of this problem, the presented proofs of Lemma 4 and of Theorem 3 have similar inaccuracies, contain mistakes (!), and lack important details (see points W.2.2 and W.2.3 below).
Besides these major weaknesses, I am not entirely convinced of the notion of a Differential Privacy schema and, in particular, of its usage to define the notion of distance of RDF graphs. The manuscript should elaborate more on the intuitions behind these notions. For instance, if we consider Example 3.1 but assume that the triple (Batman,enemy,Joker) is not in the graph---but (Joker,enemy,Batman) is---then the graph does not comply anymore with the schema P in the example. Now, we may use the same trick as described below Example 3.1 and break up the second BGP of P into two BGPs: {(?s2,lives,?o2)} and {(?s3,enemy,?o3)}. While changing the schema in this way fixes the problem that the graph does not comply with the schema, the issue now is that the data that entity 'Joker' contributes to the dataset is not in a single subgraph anymore (when partitioning the graph based on the changed schema), which is in conflict with the desired property described on page 5 (LHS, lines 24-26). In this sense, it is unclear to me how useful the notion of a Differential Privacy schema really is in practice. Moreover, I also wonder what is the complexity of checking, for every schema P and every graph G, whether G complies with P? Similarly, considering that data may change, what is the complexity of checking whether an update operation over a graph would result in a violation of the compliance? The manuscript should elaborate more on such questions.
In the remainder of this review I first discuss the two major weaknesses in detail and, thereafter, point out additionally minor issues.
W.1.1 --- Unsuitable evaluation design:
* The main issue with the way the authors have set up their evaluation is that they have created a different "Differential Privacy schema" for each query considered in the evaluation (see page 11, LHS, pages 10-26). Using such query-dependent Differential Privacy schemas to evaluate the proposed approach contradicts the approach itself because the approach is defined based on the assumption that there is a single Differential Privacy schema for the given dataset (RDF graph); this schema is independent of any particular query. By suddenly using a different Differential Privacy schema per query and, even more, by creating these schemas based on the queries themselves, the presented evaluation is not actually evaluating the proposed approach. Consequently, based on the presented evaluation, we do not learn anything about the proposed approach.
* Another mistake in the evaluation design is that it is not clear whether (or how) the authors have checked that the test datasets are compliant to the Differential Privacy schemas that have been used. The proposed approach assumes that the queried RDF graphs are compliant to the schema, and such a compliance is actually required for the approach to be correct in terms of its differential privacy guarantees. Hence, without checking compliance, the evaluation may have used the approach in a way in which it is not intended to be used, namely, without any correctness guarantees.
* A third mistake in the evaluation design is that, for calculating the elastic sensitivities of the test queries, the authors have "calculate[d] the size P(G)" (see page 11, LHS, pages 26-29). That's incorrect! By Lemma 1, instead of using the size of P(G), it must be the size of G (called size(D) in the lemma).
W.1.2 --- Important information about the evaluation are missing:
* The manuscript does not describe what the authors have actually done in the evaluation. More precisely, there is no description of what exactly has been done with the queries.
* What exactly do the columns in Tables 1 and 2 mean? In particular:
1) What is meant by "% error" and how have these values be obtained or calculated?
2) What is "Graph Size"? What graphs?
3) Is "Sensitivity" meant to represent the global sensitivity or the local sensitivity of the queries? Moreover, is this the actual value or an estimate? How exactly have these values been determined?
4) What is x in the "Elastic Stability" column?
* Speaking of the symbol x, the authors make the following claim on page 11 (LHS, lines 41-43): "As discussed before, the variable x in the expression must be chosen to maximize S^{(k)}(, )." That's not true. I don't think there is any discussion of a "variable x" in the manuscript.
* Regarding Table 1, on page 11 (LHS, lines 37-41), the authors write: "Queries showing a polynomial in the Sensitivity column are evaluated using Elastic Sensitivity, while the non polynomial queries are evaluated using Restricted Sensitivity by applying a fixed value for the most popular join value within the star." There are several issues with this sentence:
1) The Sensitivity column does not contain any polynomial but only numerical constants.
2) It is not clear what the authors mean by "Restricted Sensitivity." This term appears here for the first time in the manuscript.
3) Using "Restricted Sensitivity" for some queries seems to be another example of unsuitable evaluation design because the proposed approach is meant to use "Elastic Sensitivity" instead.
4) Even if I ignore the previous three issues, I end up wondering what "star" this sentence refers to, what exactly "the [...] join value within the star" is, and what "fixed value" has been used.
* Which values have been used for \epsilon and \delta?
W.2.1 --- Inaccuracies and mistakes in the formal definition of the proposed approach:
* The informal definition of the second type of "user query" (page 6, RHS, lines 34-39) is much too vague. I am expecting a formal definition. It may be helpful to introduce the GROUP BY operator explicitly!
* The definition of the approach (in particular, on page 7, LHS, lines 1-3) does not take into account that triple patterns may have a variable in the predicate position.
* The definition of the approach (in particular, on page 7, LHS, lines 34-41) does not take into account that a BGP may consist of multiple separate sub-BGPs that do not share any variable.
* The definition of the query that defines the value denoted by mp(?X,\bar{B},G) is wrong for two reasons: First, the query is independent of \bar{B}; I guess that the WHERE clause of the query is actually meant to be the BGP of \bar{B}. Second, the query should contain a GROUP BY clause. In its current form, the query looks more like an attempt to count the overall number of triples in the queried graph. In fact, without grouping by ?v0, the query is even invalid syntactically, because it is illegal to use ?v0 in the ORDER BY clause. It is also not clear why the query result should be sorted by ?v0. Moreover, the word "popular" in the description is unsuitable.
* The recursion step in the definition of mp_k(?X,\bar{B},G) (page 8, LHS, lines 9-14) is nondeterministic, which makes the whole definition formally insufficient. The problematic part is to say "Let B_1 ∈ B_{÷P}." First of all, it is not clearly specified which of the elements of B_{÷P} this refers to. By having in mind the symbols used before, the reader may guess that this is meant to refer to the first BGP in B_{÷P} according to a "normal ordering" introduced before (on page 7, LHS, lines 34-41). The trouble however is that, for many BGPs, there is not just one such "normal ordering" but there may be multiple. Consequently, it is still unclear which BGP of B_{÷P} is meant by B_1.
* The recursion step discussed in the previous point has further issues:
1) Using the operator "arg max" in the formula is wrong because the result of this operator would be a variable ?u rather than the maximum numeric value. Instead of "arg max", the formula should use the operator "max".
2) The formula uses the function mp_k recursively in a way as if the second argument of the function was a set of BGPs, which is actually not the case (the second argument is defined to be a CBGP).
* Issue 1) can also be found in the formulas in lines 17-20.
* Issue 2) can also be found in the formula in lines 34-35 where B' is not a CBGP (or, at least, a BGP) but it is a set of BGPs. Related issues in that same formula are that the function var is not defined for B', and neither is the function S^{(k)}_{RDF}.
* In the definition of the function S^{(k)}_{RDF} (page 8, LHS, lines 22-40) the "Differential Privacy schema P" should be another argument to his function. If not, then it is not clear which "covering star S" this definition refers to. I mean, it must be stated explicitly that S is in P.
* About the same definition, it seems weird to me that the base case of this recursive function is completely independent of the value k. What is the rationale of that?
* Speaking of rationale, in Sec.4.1, all the notions and their formulas are introduced without explanations. I am expecting a brief description of each notion (what is the intuition of the notion?) and a running example that applies these notions.
* The definition of the "induced" graphs (page 5, RHS, around lines 25-28) should also be made more accurate: Instead of saying that the (RDF) graph induced
by a star BGP B from G is "the set of RDF triples used to obtained the solution mappings [...]," it is better to say that it is "the smallest set of RDF triples in G that are necessary to obtain the solution mappings [...]."
* It is not clear whether the "sensitivity" of triple patterns and of star patterns (page 6, LHS, lines 29-46) is independent of any specific RDF graph G. It should be, but the text is not sufficiently explicit about it. In fact, the whole definition of this notion is a bit fuzzy and should be made more formal.
W.2.2 --- Problems in the proof of Lemma 4:
* First of all, the lemma must also be explicit about the value k. Does it range over all natural numbers?
* Moreover, in the formula of the lemma, I guess it should be COUNT(\bar{B}) rather than only COUNT(B).
* The base case of the induction is not shown sufficiently. First, it is inaccurate to say that "sen(S) = \kappa" because, by the definition on page 6 (LHS, lines 29-46), Sen(S) is "the multiplication of the sensitivities of the triple patterns in S" and each triple pattern may have a different \kappa. More importantly, however, it is not made clear why and how this value "sen(S) = \kappa" is greater or equal to LS^{(k)}_{COUNT(B)}(G) in all possible cases (i.e., for any B, any G, and any k). I am expecting a formal proof that refers both to Def.2 and to the authors' definition of distance between two RDF graphs.
* The induction step of the induction is even more problematic.
1) The main issue is the claim that "either S^{(k)}_{RDF}(B_1,G) = 0 or S^{(k)}_{RDF}(B',G) = 0." By the definition of function S^{(k)}_{RDF}, neither of these can be zero! For instance, for B_1, the definition says that S^{(k)}_{RDF}(B_1,G) = Sen(S_1), and Sen(S_1) must be greater than 0.
2) Moreover, as for the base case, the discussion of the induction step lacks an explicit explanation how things are related to the local sensitivity of COUNT(B) at distance k; i.e., why the formulas are guaranteed to give us values that are greater or equal to LS^{(k)}_{COUNT(B)}(G) in all possible cases. In other words, also here I am expecting references both to Def.2 and to the authors' definition of distance between two RDF graphs.
3) Besides these major issues, the discussion of the induction step uses a number of symbols that are not clearly introduced in the context of the proof (what is S in line 23? what is G_{S_1} and G_{S'} in line 26? Is B_i in line 29 meant to range over all BGPs in B' or is it just one of these BGPs?)
W.2.3 --- Problems in the proof of Theorem 3:
* The case of histogram queries is described only in a brief bullet point that is too short for me to be able make any sense of. Additionally, some parts of the given description cannot be understood. For instance, what is meant by "each changed triple"? (in fact, based on the authors' notion of distance between two RDF graphs, I would assume that this should be 'changed subgraphs' instead) How does a single changed triple (or subgraph) lead to exactly one solution mapping moving from one group to another, can't it be more than one solution mapping? What does the last sentence mean (it is not even a proper sentence in the English language)? Where is the value k in this discussion? How are these things related to the local sensitivity of the queries (Def.2)?
W.3 --- Minor issues:
* The Background section is missing bibliographic references: There must be references for Definitions 1-3, for Theorem 1, for the ideas and concepts in the last paragraph of page 3, and for Lemma 1.
* page 3, LHS, line 15: You should clarify that all the tuples in such a multiset are meant to have the same arity (same number of elements).
* page 3, LHS, line 17: What exactly is "the value" that you are referring to here?
* page 3, LHS, lines 16-25: The definition of the notions of "k-far apart" and "neighbours" seems to assume that the datasets are of the same size in terms of number of tuples. This assumption should be made explicit. Are these notions undefined for any pair D and D' with |D| != |D'|??
* same issue in the context of multi-table datasets (lines 42-51).
* page 3, LHS, line 45: What types of sets are T_1 ... T_n? Sets of tuples?
* page 3, LHS, line 45: What does "tagged" mean?
* page 3, LHS, line 45: What is meant by "data points" and what does "belonging to T_1,...,T_n" mean?
* page 5, LHS, line 6: It should be "query sensitivity" (instead of "query sensibility").
* page 5, RHS, lines 15-16: I don't see anything like that in Figures 3 and 4.
* page 6, LHS, lines 42-46: Instead of "sensitivity," I suggest to use a different name for this concept. Using the term "sensitivity" may be confusing because there already exists a different concept with this name in the context of differential privacy (as you have introduced before).
* page 6, RHS, lines 1-2: It should be "solution mapping" (instead of "solution assignment"). In general, you should be more consistent in terms of naming these things. Besides wrongly naming it "solution assignment" here, I have seen just "mapping" but also "result mapping" in other parts of the manuscript. Such inconsistency is confusing or even misleading.
* page 6, RHS, lines 29-39: I suggest to move this part to the of this section. At the current place it feels misplaced (in particular, because everything that comes below this part in the section is only about the things introduced before this part.
* page 7: There is a mistake in Example 3.3. The first triple pattern should be removed from B_2!
* page 7, LHS, line 39: "national convenience"???
* page 8, LHS, line 50: It should be "RDF triples" (instead of "RDF tuples").
* page 8, RHS, line 18: The word "sensibility" is wrong here!
* page 9: In the formula in Theorem 3, it should be S^{(k)}_{RDF}(Q,G) and not S^{(k)}(Q,G).
* page 9, RHS, line 11: There needs to be a bibliographic reference for Watdiv (in the same way as there is a reference for Wikidata a few lines before).
* page 9, RHS, line 17: What URI does this sentence refer to?
|