Differential Privacy and SPARQL

Tracking #: 2610-3824

Authors: 
Carlos Buil Aranda
Jorge Lobo
Federico Olmedo

Responsible editor: 
Guest Editors ST 4 Data and Algorithmic Governance 2020

Submission type: 
Full Paper
Abstract: 
Differential privacy is a framework that provides formal tools to develop algorithms to access databases and answer numerical and statistical queries with quantifiable accuracy and privacy guarantees. The notions of differential privacy are defined independent of the data model and the query language. Most results have been on aggregation queries such as counting or finding maximum or average values, and on grouping queries over aggregations such as the creation of histograms. The data model has been typically the relational model and the query language SQL. However, good realizations of deferential privacy for queries that required joins had been limited. This has imposed severe restrictions on applying differential privacy in RDF knowledge graphs and SPARQL. By the simple nature of RDF data, most interesting queries accessing RDF graphs will require intensive use of joins. Recently though, new techniques have been developed that can be applied to many types of joins in SQL with reasonable results. This opened the question of whether these new definitions can be transferred to RDF and SPARQL. In this paper we provide a positive answer to this question by presenting an algorithm that can answer count queries over a large class of SPARQL queries that guarantees differential privacy, if the RDF graph is accompanied with some natural semantic information about its structure. We have implemented our algorithm and conducted several experiments, showing the feasibility of our approach for large databases. Our aim has been to present an approach that can be used as a stepping stone towards extensions and other realizations of differential privacy for SPARQL and RDF.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Olaf Hartig submitted on 01/Jan/2021
Suggestion:
Major Revision
Review Comment:

This manuscript proposes an approach to answer specific types of SPARQL queries using a differential privacy guarantee.

The problem that this work aims to address is highly important and, to the best of my knowledge, it is the first work in this direction. The text is generally okay to understand but there are many small inaccuracies and formal mistakes (details below), which leaves a bad first impression. While these mistakes may be easy to fix, the presented work has two major weaknesses that require a major revision of the manuscript.

W.1: The first major weakness is that the evaluation is designed in a way that makes it unsuitable to evaluate the presented approach (see point W.1.1 below). Additionally, important information about the evaluation are missing and there are inaccuracies in the description of the evaluation (see point W.1.2 below).

W.2: The second major weakness is that the core of the proposed approach is a technical result (Theorem 3 together with Lemma 4) that is not proven sufficiently. An initial problem here is that the formal definition of the proposed approach itself is full of inaccuracies and small mistakes (see point W.2.1 below). On top of this problem, the presented proofs of Lemma 4 and of Theorem 3 have similar inaccuracies, contain mistakes (!), and lack important details (see points W.2.2 and W.2.3 below).

Besides these major weaknesses, I am not entirely convinced of the notion of a Differential Privacy schema and, in particular, of its usage to define the notion of distance of RDF graphs. The manuscript should elaborate more on the intuitions behind these notions. For instance, if we consider Example 3.1 but assume that the triple (Batman,enemy,Joker) is not in the graph---but (Joker,enemy,Batman) is---then the graph does not comply anymore with the schema P in the example. Now, we may use the same trick as described below Example 3.1 and break up the second BGP of P into two BGPs: {(?s2,lives,?o2)} and {(?s3,enemy,?o3)}. While changing the schema in this way fixes the problem that the graph does not comply with the schema, the issue now is that the data that entity 'Joker' contributes to the dataset is not in a single subgraph anymore (when partitioning the graph based on the changed schema), which is in conflict with the desired property described on page 5 (LHS, lines 24-26). In this sense, it is unclear to me how useful the notion of a Differential Privacy schema really is in practice. Moreover, I also wonder what is the complexity of checking, for every schema P and every graph G, whether G complies with P? Similarly, considering that data may change, what is the complexity of checking whether an update operation over a graph would result in a violation of the compliance? The manuscript should elaborate more on such questions.

In the remainder of this review I first discuss the two major weaknesses in detail and, thereafter, point out additionally minor issues.

W.1.1 --- Unsuitable evaluation design:

* The main issue with the way the authors have set up their evaluation is that they have created a different "Differential Privacy schema" for each query considered in the evaluation (see page 11, LHS, pages 10-26). Using such query-dependent Differential Privacy schemas to evaluate the proposed approach contradicts the approach itself because the approach is defined based on the assumption that there is a single Differential Privacy schema for the given dataset (RDF graph); this schema is independent of any particular query. By suddenly using a different Differential Privacy schema per query and, even more, by creating these schemas based on the queries themselves, the presented evaluation is not actually evaluating the proposed approach. Consequently, based on the presented evaluation, we do not learn anything about the proposed approach.

* Another mistake in the evaluation design is that it is not clear whether (or how) the authors have checked that the test datasets are compliant to the Differential Privacy schemas that have been used. The proposed approach assumes that the queried RDF graphs are compliant to the schema, and such a compliance is actually required for the approach to be correct in terms of its differential privacy guarantees. Hence, without checking compliance, the evaluation may have used the approach in a way in which it is not intended to be used, namely, without any correctness guarantees.

* A third mistake in the evaluation design is that, for calculating the elastic sensitivities of the test queries, the authors have "calculate[d] the size P(G)" (see page 11, LHS, pages 26-29). That's incorrect! By Lemma 1, instead of using the size of P(G), it must be the size of G (called size(D) in the lemma).

W.1.2 --- Important information about the evaluation are missing:

* The manuscript does not describe what the authors have actually done in the evaluation. More precisely, there is no description of what exactly has been done with the queries.

* What exactly do the columns in Tables 1 and 2 mean? In particular:
1) What is meant by "% error" and how have these values be obtained or calculated?
2) What is "Graph Size"? What graphs?
3) Is "Sensitivity" meant to represent the global sensitivity or the local sensitivity of the queries? Moreover, is this the actual value or an estimate? How exactly have these values been determined?
4) What is x in the "Elastic Stability" column?

* Speaking of the symbol x, the authors make the following claim on page 11 (LHS, lines 41-43): "As discussed before, the variable x in the expression must be chosen to maximize S^{(k)}(, )." That's not true. I don't think there is any discussion of a "variable x" in the manuscript.

* Regarding Table 1, on page 11 (LHS, lines 37-41), the authors write: "Queries showing a polynomial in the Sensitivity column are evaluated using Elastic Sensitivity, while the non polynomial queries are evaluated using Restricted Sensitivity by applying a fixed value for the most popular join value within the star." There are several issues with this sentence:
1) The Sensitivity column does not contain any polynomial but only numerical constants.
2) It is not clear what the authors mean by "Restricted Sensitivity." This term appears here for the first time in the manuscript.
3) Using "Restricted Sensitivity" for some queries seems to be another example of unsuitable evaluation design because the proposed approach is meant to use "Elastic Sensitivity" instead.
4) Even if I ignore the previous three issues, I end up wondering what "star" this sentence refers to, what exactly "the [...] join value within the star" is, and what "fixed value" has been used.

* Which values have been used for \epsilon and \delta?

W.2.1 --- Inaccuracies and mistakes in the formal definition of the proposed approach:

* The informal definition of the second type of "user query" (page 6, RHS, lines 34-39) is much too vague. I am expecting a formal definition. It may be helpful to introduce the GROUP BY operator explicitly!

* The definition of the approach (in particular, on page 7, LHS, lines 1-3) does not take into account that triple patterns may have a variable in the predicate position.

* The definition of the approach (in particular, on page 7, LHS, lines 34-41) does not take into account that a BGP may consist of multiple separate sub-BGPs that do not share any variable.

* The definition of the query that defines the value denoted by mp(?X,\bar{B},G) is wrong for two reasons: First, the query is independent of \bar{B}; I guess that the WHERE clause of the query is actually meant to be the BGP of \bar{B}. Second, the query should contain a GROUP BY clause. In its current form, the query looks more like an attempt to count the overall number of triples in the queried graph. In fact, without grouping by ?v0, the query is even invalid syntactically, because it is illegal to use ?v0 in the ORDER BY clause. It is also not clear why the query result should be sorted by ?v0. Moreover, the word "popular" in the description is unsuitable.

* The recursion step in the definition of mp_k(?X,\bar{B},G) (page 8, LHS, lines 9-14) is nondeterministic, which makes the whole definition formally insufficient. The problematic part is to say "Let B_1 ∈ B_{÷P}." First of all, it is not clearly specified which of the elements of B_{÷P} this refers to. By having in mind the symbols used before, the reader may guess that this is meant to refer to the first BGP in B_{÷P} according to a "normal ordering" introduced before (on page 7, LHS, lines 34-41). The trouble however is that, for many BGPs, there is not just one such "normal ordering" but there may be multiple. Consequently, it is still unclear which BGP of B_{÷P} is meant by B_1.

* The recursion step discussed in the previous point has further issues:
1) Using the operator "arg max" in the formula is wrong because the result of this operator would be a variable ?u rather than the maximum numeric value. Instead of "arg max", the formula should use the operator "max".
2) The formula uses the function mp_k recursively in a way as if the second argument of the function was a set of BGPs, which is actually not the case (the second argument is defined to be a CBGP).

* Issue 1) can also be found in the formulas in lines 17-20.

* Issue 2) can also be found in the formula in lines 34-35 where B' is not a CBGP (or, at least, a BGP) but it is a set of BGPs. Related issues in that same formula are that the function var is not defined for B', and neither is the function S^{(k)}_{RDF}.

* In the definition of the function S^{(k)}_{RDF} (page 8, LHS, lines 22-40) the "Differential Privacy schema P" should be another argument to his function. If not, then it is not clear which "covering star S" this definition refers to. I mean, it must be stated explicitly that S is in P.

* About the same definition, it seems weird to me that the base case of this recursive function is completely independent of the value k. What is the rationale of that?

* Speaking of rationale, in Sec.4.1, all the notions and their formulas are introduced without explanations. I am expecting a brief description of each notion (what is the intuition of the notion?) and a running example that applies these notions.

* The definition of the "induced" graphs (page 5, RHS, around lines 25-28) should also be made more accurate: Instead of saying that the (RDF) graph induced
by a star BGP B from G is "the set of RDF triples used to obtained the solution mappings [...]," it is better to say that it is "the smallest set of RDF triples in G that are necessary to obtain the solution mappings [...]."

* It is not clear whether the "sensitivity" of triple patterns and of star patterns (page 6, LHS, lines 29-46) is independent of any specific RDF graph G. It should be, but the text is not sufficiently explicit about it. In fact, the whole definition of this notion is a bit fuzzy and should be made more formal.

W.2.2 --- Problems in the proof of Lemma 4:

* First of all, the lemma must also be explicit about the value k. Does it range over all natural numbers?

* Moreover, in the formula of the lemma, I guess it should be COUNT(\bar{B}) rather than only COUNT(B).

* The base case of the induction is not shown sufficiently. First, it is inaccurate to say that "sen(S) = \kappa" because, by the definition on page 6 (LHS, lines 29-46), Sen(S) is "the multiplication of the sensitivities of the triple patterns in S" and each triple pattern may have a different \kappa. More importantly, however, it is not made clear why and how this value "sen(S) = \kappa" is greater or equal to LS^{(k)}_{COUNT(B)}(G) in all possible cases (i.e., for any B, any G, and any k). I am expecting a formal proof that refers both to Def.2 and to the authors' definition of distance between two RDF graphs.

* The induction step of the induction is even more problematic.
1) The main issue is the claim that "either S^{(k)}_{RDF}(B_1,G) = 0 or S^{(k)}_{RDF}(B',G) = 0." By the definition of function S^{(k)}_{RDF}, neither of these can be zero! For instance, for B_1, the definition says that S^{(k)}_{RDF}(B_1,G) = Sen(S_1), and Sen(S_1) must be greater than 0.
2) Moreover, as for the base case, the discussion of the induction step lacks an explicit explanation how things are related to the local sensitivity of COUNT(B) at distance k; i.e., why the formulas are guaranteed to give us values that are greater or equal to LS^{(k)}_{COUNT(B)}(G) in all possible cases. In other words, also here I am expecting references both to Def.2 and to the authors' definition of distance between two RDF graphs.
3) Besides these major issues, the discussion of the induction step uses a number of symbols that are not clearly introduced in the context of the proof (what is S in line 23? what is G_{S_1} and G_{S'} in line 26? Is B_i in line 29 meant to range over all BGPs in B' or is it just one of these BGPs?)

W.2.3 --- Problems in the proof of Theorem 3:

* The case of histogram queries is described only in a brief bullet point that is too short for me to be able make any sense of. Additionally, some parts of the given description cannot be understood. For instance, what is meant by "each changed triple"? (in fact, based on the authors' notion of distance between two RDF graphs, I would assume that this should be 'changed subgraphs' instead) How does a single changed triple (or subgraph) lead to exactly one solution mapping moving from one group to another, can't it be more than one solution mapping? What does the last sentence mean (it is not even a proper sentence in the English language)? Where is the value k in this discussion? How are these things related to the local sensitivity of the queries (Def.2)?

W.3 --- Minor issues:

* The Background section is missing bibliographic references: There must be references for Definitions 1-3, for Theorem 1, for the ideas and concepts in the last paragraph of page 3, and for Lemma 1.

* page 3, LHS, line 15: You should clarify that all the tuples in such a multiset are meant to have the same arity (same number of elements).

* page 3, LHS, line 17: What exactly is "the value" that you are referring to here?

* page 3, LHS, lines 16-25: The definition of the notions of "k-far apart" and "neighbours" seems to assume that the datasets are of the same size in terms of number of tuples. This assumption should be made explicit. Are these notions undefined for any pair D and D' with |D| != |D'|??

* same issue in the context of multi-table datasets (lines 42-51).

* page 3, LHS, line 45: What types of sets are T_1 ... T_n? Sets of tuples?

* page 3, LHS, line 45: What does "tagged" mean?

* page 3, LHS, line 45: What is meant by "data points" and what does "belonging to T_1,...,T_n" mean?

* page 5, LHS, line 6: It should be "query sensitivity" (instead of "query sensibility").

* page 5, RHS, lines 15-16: I don't see anything like that in Figures 3 and 4.

* page 6, LHS, lines 42-46: Instead of "sensitivity," I suggest to use a different name for this concept. Using the term "sensitivity" may be confusing because there already exists a different concept with this name in the context of differential privacy (as you have introduced before).

* page 6, RHS, lines 1-2: It should be "solution mapping" (instead of "solution assignment"). In general, you should be more consistent in terms of naming these things. Besides wrongly naming it "solution assignment" here, I have seen just "mapping" but also "result mapping" in other parts of the manuscript. Such inconsistency is confusing or even misleading.

* page 6, RHS, lines 29-39: I suggest to move this part to the of this section. At the current place it feels misplaced (in particular, because everything that comes below this part in the section is only about the things introduced before this part.

* page 7: There is a mistake in Example 3.3. The first triple pattern should be removed from B_2!

* page 7, LHS, line 39: "national convenience"???

* page 8, LHS, line 50: It should be "RDF triples" (instead of "RDF tuples").

* page 8, RHS, line 18: The word "sensibility" is wrong here!

* page 9: In the formula in Theorem 3, it should be S^{(k)}_{RDF}(Q,G) and not S^{(k)}(Q,G).

* page 9, RHS, line 11: There needs to be a bibliographic reference for Watdiv (in the same way as there is a reference for Wikidata a few lines before).

* page 9, RHS, line 17: What URI does this sentence refer to?

Review #2
By Simon Steyskal submitted on 03/Feb/2021
Suggestion:
Major Revision
Review Comment:

In the present paper, the authors introduce a differentially private SPARQL query engine for COUNT queries and support their proposed solution by providing an evaluation using Wikidata and Watdiv. At times the paper is rather hard to read, especially since more often than not particular choices of formalisms aren't motivated enough. Please find detailed remarks underneath ->

## General (throughout the entire paper)
---

1. s/[Dd]eferential(ly)?/[Dd]ifferential(ly)?/ -> deferential != differential (cf. https://www.merriam-webster.com/dictionary/deferential vs. https://www.merriam-webster.com/dictionary/differential)
2. a wild mix of upper/lower case of certain terms (e.g. [Dd]ifferential [Pp]rivacy, [Ss]tar queries, ...) be consistent!
3. s/[sS]tart/star/ -> p.5 lines 1,5 and p.14 line 25

## Abstract
---

1. "The notions of differential privacy are defined independent of the data model and the query language." -> the DM and the QL of what?
2. "Most results have been ..." -> what results?
3. "The data model has been typically the relational model and the query language SQL." -> again: what data model? "the" implies you are referring to a particular data model
4. s/deferential/differential/ -> deferential != differential (cf. https://www.merriam-webster.com/dictionary/deferential vs. https://www.merriam-webster.com/dictionary/differential)
5. "... good realizations of deferential[sic!] privacy for queries that required joins had been limited." -> what's "good"? why are joins suddenly relevant?
6. "most interesting queries accessing RDF graphs will require intensive use of joins" -> interesting according to whom? isn't the "interestingness" of smth. subjective?
7. "Recently though, new techniques" -> -though
8. "new techniques have been developed that can be applied to many types of joins in SQL with reasonable results" -> what "new techniques"? I reckon you are referring to DiffPriv-aware techn., but this isn't immediately clear from the sentence; reasonable as in ...?
9. "... if the RDF graph is accompanied with some natural semantic information about its structure" -> what's "some natural semantic information"? some additional triples? where does this info come from? how much is "some"?

## Introduction
---

1. "Privacy is associated with liberty, but it is also associated with privilege ..., with confidentiality, with nonconformity and dissent, with shame and embarrassment, with the deviant and the taboo (...), and with subterfuge and concealment" -> any particular reason you chose this rather negatively connoted definition of privacy?
2. "... Li et al [2] have looked at recent privacy breaches" -> I wouldn't call 2013 recent (that's ~8 years ago)
3. "and concluded, perhaps not surprising," -> -perhaps not surprising;
4. "that an electronic privacy breaches" -> -an
5. "always ended with giving an attacker the ability to identify using public data whether" -> identify (using public data) whether
6. "such the advance" -> such as the advance
7. "data is frequently made public" -> what kind of data? any data?
8. "But even in such cases," -> !Especially! in such cases..
9. "Publishing data with perfect privacy means that no assumption can be made about the prior knowledge an attacker may have about the supposedly anonymous set" -> what? so is this then a bad thing? i.e. in order for being able to assess an attackers prior knowledge about data requires the data to not be perfectly private? what?
10. s/prefect/perfect/
11. "t-closeness" -> add ref
12. "were developed but they were shown to have weak privacy guarantees." -> according to whom? add reference backing up your claim!
13. "In spite of its limitations" -> which are..? elaborate and/or add reference
14. s/deferentially/differentially/
15. "from the query applied to similar datasets, D'." -> from the query applied to similar datasets, D' \in\mathds{D}
16. "Smaller the diversity required, the better the utility. " -> The smaller the...
17. " This is called in the literature sensitivity of the query." -> what literature? add ref.!
18. " Calculating sensitivity is not trivial and approximations are used." -> says who? who uses approx. ? add ref!
19. "However, until recently, in order to get reasonable approximations of sensitivity ...[6]" -> 2009 != recently
20. s/deferential/differential/
21. "By the simple nature that data in RDF repositories is stored in binary relations, most interesting queries will require operations equivalent to joins" -> what? interesting according to whom? rephrase the whole sentence!
22. "This result has been possible by introducing" -> has been made possible
23. "of RDF tuples into sub-graphs that can be then used as single units for privacy protection." -> single units of what? what's a "unit"? rephrase!
24. "and it should not be difficult for an administrator to define" -> but you are not sure? what administrators?
25. "that uses the approximation to answer counting and grouping SPARQL queries" -> +"with joins" right?
26. "using the Wikidata knowledge base. We also run simulations using the Watdiv evaluation framework " -> add references for both wikidata and watdiv
27. "that can be answered with reasonable utility in smaller datasets" -> reasonable according to..? utility as in ..?
28. "with the fundamental concepts" -> to the fundamental concepts

## Background
---

1. "and how it has been addressed." -> by whom? you or in related work?
2. "Intuitively, a randomized algorithm" -> what's a randomized algo? why do we need them? elaborate!
3. "Formally, this corresponds to the notion of bounded differential privacy [8], where all datasets separated by a finite distance share the same number of tuples. " -> does it though? if I read [8] correctly, they say that "bounded -> |D| = |D'|", but not that "d(D,D') = k -> |D| = |D'|" right?
4. " In the reminder" -> remainder
5. "A dataset formed by multiple sets T1, ... , Tn" -> later you use T to denote dataset transformers.. as those two aren't the same thing, may consider choosing a different letter to avoid confusion
6. " In the realm of differential privacy, this "faithfulness" property is referred to as the mechanism utility. " -> this needs to be introduced earlier! (see 16/27 above)
7. "Definition 2" -> indicate that D' \in\mathds{D}
8. "On the one hand, it allows handling queries that fail to have a bounded global sensitivity, but do have a bounded local sensitivity" -> are there queries which have neither?

## SPARQL Preliminaries
---

1. "what we would like to protect is the disclosure of the presence or absence of a sub-graph of G that represents the contribution of a single individual or entity to G." -> what? what's a contribution to G? what's the difference between an individual and an entity?
2. "the contribution of entity i to the dataset G = Ug_i" -> add value range of i
3. "we define graphs at distance 1 as graphs" -> at distance 1 to what? each other?
4. s/don't/do not/
5. "In a simple start we call the center of the star" -> s/start/star/
6. "Note that start BGPs are similar to Star queries" -> s/start/star/
7. "triple patter in a star" -> s/patter/pattern/
8. "Figures 3 and 4 show two disjoint star BGPs with central vertex ?v0." -> are you talking about the figures on page 10? or is this a copy/paste error?
9. "Hence, given an RDF graph G, we will call the (RDF) graph induced ... induced by each of the star BGPs in the schema." -> please shorten this sentence and/or split it and/or try to rephrase so it gets easier to read and understand..
10. "iff G coincides with the graph induced by P from G." -> when are two graphs coinciding with each other?
11. "Administrators should then provide Differential Privacy..." -> what Administrators?
12. "with the largest stars possible.." -> the largest stars of what? the largest amongst what?
13. " A graph g belongs to the set iff ..." -> what set? move the following definition here: "and denote the set of all sub-graphs, {g1, ... , gn}, by P(G)"
14. "and its meaning, denote by [[B¯]]_G" -> denoted
15. "Example 3.2" -> what's the purpose of this example? context? you don't refer to it in the text..
16. "For any query B¯ ... we can naturally define ... a split B÷P of B as follows" -> what's a split? why do we need such a split? elaborate!
17. "The interest of B÷P is that it lets us isolate the part of the graph necessary to answer each Bi" -> what; rephrase

## Evaluation
---

1. the formatting of lines 18-26 on page 11 is completely off..
2. "S^(k)(,)" -> (,) ?
3. fix formatting of Table 1/2's Elastic Stability column

## References
---

1. [10-11] -> duplicates
2. [13] -> use latest Recommendation version not Last Call WD

Review #3
Anonymous submitted on 09/Feb/2021
Suggestion:
Major Revision
Review Comment:

The paper focuses on how to extend the notion of Differential Privacy (DF) applied to relational joins (Johnson et al) to basic graph patterns with filters.  

It builds upon the concept of a Differential Privacy schema, which consists of triples or conjunctions of triples that should be considered by the DF mechanism.

The results show that only star-shaped queries compared to linear or snowflake queries can achieve better privacy whereas also the size of the dataset makes a difference..
The latter results are not surprising as this was also shown by previous work in the relational setting. The experimental insights are thus somehow to be enhanced. 

There are revision items that can be addressed to make the paper more substantial as follows:

- the part on schema compliance is a bit less formal compared to other parts; for instance, the embedding of the schema constructs into the instance are computed using homomorphisms but this is never made explicit.

- it is unclear why the authors want to present negative results on a synthetic graph generator with small datasets and with the caveat that WatDiv query generation is not suitable. They might want to enrich these experiments with other generators (for instance gMark https://github.com/gbagan/gmark. A more thorough study can be done.

- I do not think the authors make justice to the discussion of related work, especially reference [23]. The latter is a declarative framework for specifying privacy and utility queries and dealing with the application of anonymization operations on the graph. As such, blank nodes are just one of these anonymization operations.

- the paper has many English flaws and typos. The authors should check them. e.g. page 5 "start BGPs", "triple patter"