Differential Privacy and SPARQL

Tracking #: 3421-4635

Carlos Buil Aranda
Jorge Lobo
Federico Olmedo

Responsible editor: 
Guest Editors ST 4 Data and Algorithmic Governance 2020

Submission type: 
Full Paper
Differential Privacy is a framework that provides formal tools to develop algorithms to access databases and answer statistical queries with quantifiable accuracy and privacy guarantees. The notions of Differential Privacy are defined independently of the data model and the query language at steak. Most Differential Privacy results have been obtained on aggregation queries such as counting or finding maximum or average values, and on grouping queries over aggregations such as the creation of histograms. So far, the data model used by the framework research has typically been the relational model and the query language SQL. However, effective realizations of Differential Privacy for SQL queries that required joins had been limited. This has imposed severe restrictions on applying Differential Privacy in RDF knowledge graphs and SPARQL query language. By the simple nature of RDF data, most useful queries accessing RDF graphs will require intensive use of joins. Recently, new Differential Privacy techniques have been developed that can be applied to many types of joins in SQL with reasonable results. This opened the question of whether these new results carry over to RDF and SPARQL. In this paper we provide a positive answer to this question by presenting an algorithm that can answer counting queries over a large class of SPARQL queries that guarantees Differential Privacy, if the RDF graph is accompanied with semantic information about its structure. We have implemented our algorithm and conducted several experiments, showing the feasibility of our approach for large graph databases. Our aim has been to present an approach that can be used as a stepping stone towards extensions and other realizations of Differential Privacy for SPARQL and RDF.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Olaf Hartig submitted on 21/Apr/2023
Minor Revision
Review Comment:

I confirm that most of the weaknesses mentioned in my review of the previous version of this manuscript has been addressed now. However, there are still a few issues for which I would like to see another minor revision (after which, I am sure, the manuscript will be ready to be accepted).

1/ There are issues with the definition of the notion of "induced sub-graph" that was added. The notion is defined for a given pair consisting of a star BGP S and a solution mapping μ over an RDF graph G. The first issue is that the notion of "a solution mapping over an RDF graph" is undefined, and also not a notion commonly considered in the SPARQL literature (instead, the notion of a solution mapping as commonly defined is independent of any particular RDF graph). The second issue is that, by defining "induced sub-graph" based on such a pair, it should be necessary to mention both the corresponding star BGP and the solution mapping when referring to an induced sub-graph. However, the paper refers to induced sub-graphs by mentioning a star BGP and an RDF graph instead; by doing that, it is not clear anymore, which solution mapping would be used to construct the corresponding subgraph and, thus, it is not clear which subgraph exactly it is. This problem becomes evident in Example 3.2 where the subgraph g2 contains two phone triples. There is no way to produce this subgraph from the star BGP S1 and a single solution mapping! Subgraph g4 in the example has the same problem. Another smaller issue is that the authors start in the example to use some notation for induced sub-graphs, namely S1(|G|), which has not been introduced.

2/ The definition of the center of a triple pattern (page 9, LHS, lines 31-37) still has the following issue. The definition of this notion assumes a given dp-schema but, later in the paper, the notion is used as if it is independent of any dp-schema. In other words, for a given triple pattern tp, it is not sufficient to write center(tp) but, instead, it must be center(tp,P) where P is a dp-schema. This change to the definition must then be reflected in the rest of the paper.

3/ Point 7 from my previous review has not been addressed. I am repeating this point here (but with the updated line numbers): There is an incorrect claim in Sec.5.2 on page 11, lines 3-4 on the RHS. The claim is that "these frequencies can be pre-computed as they are independent of any query." This claim is incorrect because, by definition, these frequencies depend on the BGP of the given user queries. An indication for that is that such a frequency is denoted by mpv(?x,B,G), see line 21, where B is the BGP of the given user query. Also, the suggested approach to determine the frequency, as described in lines 23-30, is based on a query that uses the BGP of the user query in its WHERE clause.
--> I wonder why this point has not been addressed. Perhaps there is a misunderstanding about what exactly are the frequencies that the claim refers to? If so, then this part of the claim should be made more clear. Otherwise, I would expect that the claim is elaborated on further because I still don't see how this claim can be correct.

4/ Many small typos in the parts of the manuscript that have been changed or added for this revision leave the impression that this revision was done in a haste.

* "ralational model" on line 12, RHS, page 2

* "Johnson et al. approach’" -> "Johnson et al.'s approach" on line 19, RHS, page 2

* "start" -> "star" on line 38, RHS, page 6

* "triple patter" on line 40, RHS, page 6