Differential Privacy and SPARQL

Tracking #: 3166-4380

Carlos Buil Aranda
Jorge Lobo
Federico Olmedo

Responsible editor: 
Guest Editors ST 4 Data and Algorithmic Governance 2020

Submission type: 
Full Paper
Differential Privacy is a framework that provides formal tools to develop algorithms to access databases and answer numerical and statistical queries with quantifiable accuracy and privacy guarantees. The notions of Differential Privacy are defined independently of the data model and the query language. Most Differential Privacy results have been obtained on aggregation queries such as counting or finding maximum or average values, and on grouping queries over aggregations such as the creation of histograms. The data model used by the framework research has been typically the relational model and the query language SQL. However, effective realizations of Differential Privacy for SQL queries that required joins had been limited. This has imposed severe restrictions on applying Differential Privacy in RDF knowledge graphs and SPARQL. By the simple nature of RDF data, most useful queries accessing RDF graphs will require intensive use of joins. Recently, new Differential Privacy techniques have been developed that can be applied to many types of joins in SQL with reasonable results. This opened the question of whether these new definitions can be transferred to RDF and SPARQL. In this paper we provide a positive answer to this question by presenting an algorithm that can answer count queries over a large class of SPARQL queries that guarantees Differential Privacy, if the RDF graph is accompanied with semantic information about its structure. We have implemented our algorithm and conducted several experiments, showing the feasibility of our approach for large graph databases. Our aim has been to present an approach that can be used as a stepping stone towards extensions and other realizations of Differential Privacy for SPARQL and RDF.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/Aug/2022
Review Comment:

The authors have addressed my comments about revising the related work and recasting the set of queries on which the approach is evaluated.

A minor suggestion for the final version is to improve the readability of the publicly available git repository and add more instructions about reproducibility.

Review #2
By Simon Steyskal submitted on 23/Oct/2022
Review Comment:

The paper has been significantly revised and together with the authors' response accompanying the resubmission most if not all of my raised remarks/questions were addressed. Thank you very much!

There are only a few minor points that I would like to see addressed:

1) emploies => employs (throughout the entire paper)
2) "Formally, this corresponds to the notion of bounded differential privacy [8], where all datasets separated by a finite distance share the same number of tuples. " -> does it though? if I read [8] correctly, they say that "bounded -> |D| = |D'|", but not that "d(D,D') = k -> |D| = |D'|" right?
* => what I meant was that [8] (now [15]) states that "Bounded differential privacy derives its name from the fact that all of the datasets D1, D2 have a fixed size n" which I read as "bounded because |D1|=|D2|=n" but not as "d(D1,D2) = n implies that |D1|=|D2|". If I understood that incorrectly, please ignore this point.
3) "The only case that this is not the case is for queries Q1a and q1b" -> Q1b
4) "Nissim et al. show how to exploit this idea of instance-based noise [17, 18]" -> isn't [18] the extended version of [17]? why cite both?
5) Formatting of Appendices -> only a minor remark, but I would appreciate a "cleaner" formatting of both table 3 (e.g. overlapping cell values esp. "Snowflake..", text margin is too small) and queries (sometimes there's a single "." in a row; also linewrap leads to weird indentations)

all in all the quality of the paper has increased significantly and I appreciate the authors' efforts for doing so!

Review #3
By Olaf Hartig submitted on 27/Oct/2022
Minor Revision
Review Comment:

The initial version of this manuscript, which I reviewed earlier, had significant flaws both in the technical part and in the evaluation. Given the revised version now, I am happy to see that these major flaws have been fixed. However, the revised version still has the following three weaknesses:

W1 (lack of a motivation of the presented work): While there is a general motivation for Differential Privacy, the authors do not motivate why we should consider and study Differential Privacy for RDF and SPARQL.

W2 (formal inaccuracies): There are still a number of formal inaccuracies in the approach (details below). These need to be addressed before the manuscript can be accepted.

W3 (confusions about the evaluation): Some aspects of the evaluation are confusing and not clear to me (details below).

Addressing these three weaknesses would require another minor revision of the manuscript.

The remainder of this review elaborates more on weaknesses W2 and W3. Afterwards, I additionally list several minor issues that should also be addressed.

Remaining formal inaccuracies in the approach:

1/ The notion of "induced sub-graph" must be defined formally because the rest of the formalization (and, thus, the approach as a whole) builds on this notion. Without a formal definition it is not clear whether a star S may induce multiple subgraphs for a single individual or whether/how this notion guarantees that the subgraphs are disjoint (which would be required for Def.5). In fact, it is not even clear what exactly the "set of sub-graphs" is and what each sub-graph in this set is related to. The reader can figure the answers to these questions by studying the examples, but to avoid any misunderstanding or ambiguity it is important to be explicit about these questions and define formally what the induced sub-graphs of a star are.

2/ The condition in Def.5 should be more restrictive. That is, the disjointness should not be just between with stars with their triple patterns but only between the sets of predicate URIs that the triple patterns in the stars have. In other words, by the current definition, the set {S1,S2} with following two stars S1 and S2 is a dp-schema but it should not be one:
S1={ (?s1, livesIn, ?o1) }
S2={ (?s2, livesIn, ?o2) }

3/ Regarding the test of compliance verification as outlined on page 7 (lines 14-20 on the RHS), I think it is not that easy. Just "verifying that all predicates in the graph appear also in (some star of) the schema" is not enough. The join variable within each star needs to be taken into account as well! For instance, consider a dp-schema with a single star as follows:
S = { (?s, livesIn, ?o1), (?s, phone, ?o2) }.
The RDF graph G={ (Alice,livesIn,Seattle), (Bob,phone,"+1415555") } does not comply with this dp-schema but the simply predicate-based test would wrongly indicate that it does.

4/ The formula at the end of Def.8 is wrong. The product should range over tp ∈ S (not over S ∈ P), and the factors of the product should then be κ(tp), rather than κ(S).

5/ The definitions of the types of user queries in Sec.4 (page 9, lines 11-33 on the LHS) are inaccurate. First of all, it is not clear what "the center position of a triple pattern" is. Assuming that this is sloppily phrased, the actual issue is that there are three conditions for the triple pattern t but this fact is not made obvious to the reader. The first condition is that t must be in \bar{B}, the second condition is that t must be in one of the stars of the dp-schema, and the third condition is that t must contain the variable ?x. Additionally, there is the condition that ?x must be the center of the star that contains t. These conditions need to be separated and formalized more clearly. Some of the notions relevant to this end are introduced later (page 9, lines 13-27 on the RHS). Hence, these notions should be introduced earlier and then used in the definition of the types of user queries. However, even after doing that, there remains the issue that the variables in the user queries may be different from the variables that are used in the triple patterns in the stars of the dp-schema. Hence, the definition of the types of user queries also needs to include canonicalization of variable names. For instance, the variable ?x in Example 4.1 is not in any of the corresponding stars in Example 3.1.

6/ The definition of the split of a BGP (page 9, RHS, lines 28-36) has two issues.
First, it has a similar issue as mentioned in point 3 above. Having the condition pred(Bi) ⊆ pred(S), which considers only the predicate URIs, is insufficient. The respective positions of the join variable in all triple patterns of Bi need to be considered as well.
Second, once the previous issue is fixed, there remains another issue similar to the one mentioned at the and of the point 5 above. That is, the variables in the given user query may be completely different from the variables in the stars of the dp-schema. Hence, some canonicalization of variable names is needed here too.

7/ There is an incorrect claim in Sec.5.2 on page 11, lines 35-36 on the LHS. The claim is that "these frequencies can be pre-computed as they are independent of any query." This claim is incorrect because, by definition, these frequencies depend on the BGP of the given user queries. An indication for that is that such a frequency is denoted by mpv(?x,B,G), see line 21, where B is the BGP of the given user query. Also, the suggested approach to determine the frequency, as described in lines 23-30, is based on a query that uses the BGP of the user query in its WHERE clause.

Confusions about the evaluation:

8/ I don't exactly understand the method used for creating and, in particular, for naming the test queries. What does the 'b' and the 'c' mean? What do, say, Q3a-Q3c have in common (given that their names begin with the same prefix)? What do, say, Q3c and Q4c have in common (given that their names end with the same letter)?

9/ In several places, the evaluation talks about the "result size" of the test queries (for instance, in line 40 on the RHS of page 15) and there is even a chart that plots measurements related to "Result Size" (Figure 5). Given that the test queries are all COUNT queries, I wonder what is meant by result size of these queries and how the result sizes can be different. Shouldn't the result size of a COUNT query be 1 (unless there is a GROUP BY in the queries, which is not the case for the test queries used in the evaluation)?

10/ Table 2 raises two questions for me: First, it is not clear what the value of x in Table 2 is. Second, the table caption mentions "Columns Percentage Error 1.0 and 0.1" but there are no such columns in the actual table.

11/ page 18, RHS, lines 5-25: The paragraph here makes several references to Watdiv-based experiments. Where is the description of these experiments?

Minor issues to be addressed:


* What is "the framework research"?

* Regarding the following two sentences: "However, effective realizations of Differential Privacy for SQL queries that required joins had been limited. This has imposed severe restrictions on applying Differential Privacy in RDF knowledge graphs and SPARQL." -- This part needs to be rephrased because it is not clear to the reader why limitations for SQL queries would impose something for RDF and SPARQL (What do SQL queries have to do with RDF and SPARQL??).

* What are "count queries over a large class of SPARQL queries"? The word "over" is misleading/confusing here. It sounds as if you are talking about queries that are meant to count SPARQL queries (from that "large class").


* page 2, LHS, line 8: It is not clear what is "such data" that you are referring to here.

* page 2, RHS, line 5: It is not clear what you mean by "good" here. What makes a "realization[] of Differential Privacy for queries that required joins" a "good" one?

* page 2, RHS, lines 4-9: Same issue as in the abstract. What do SQL queries have to do with RDF and SPARQL?

* page 2, RHS, line 15: It is not clear what you mean by "reasonable" here. What are "reasonable results"? ...and what are "very reasonable results"?

* page 2, RHS, lines 15-17: Why? What is the motivation of this question? To some extend the issue here is, again, that it is not clear what SQL queries have to do with RDF and SPARQL. Additionally, however, there is the bigger issue that you never motivate why you aim to study Differential Privacy for RDF and SPARQL.


* page 3, LHS, lines 17-18: What is "the value of [... a] tuple[]"? Does each tuple have (only) one value?

* Theorem 1: Indicate clearly where this result is from. Write: "Theorem 1 [16]. Given ..."

* Same for Lemma 1!

* page 3, RHS, line 34: "global sensibility" -> "global sensitivity"


* page 7, LHS, line 37: This subsection should have a number (3.2).


* page 8, RHS, line 45: You need to make clear what you mean by "semantically valid."


* page 11, LHS, lines 26-27: Given that the aim is to find the "most popular values", the variable ?x needs to be projected by the SELECT clause of this query. Otherwise, you would only get counts without knowing which values they are for.


* page 15, LHS: The paragraph here mentions the word "start" in several places. I assume that is a typo and should be "star" instead.

* page 15, LHS, lines 18-19: It is not clear what notion of coverage you mean when you talk about " queries covered by a single start from the dp-schema."

* page 15, LHS, lines 19-20: How can "queries [...] add [something such as] filters and remove [something such as] triple patterns"? ...and where would queries add and remove such things??