Privacy in Ontology-based Information Systems: A Pending Matter
Review 1 by Michel Dumontier:
This well written article addresses the issue of privacy of ontology-based systems. After a brief introduction consisting of illustrating use cases in the medical domain, it identifiers two key challenges : 1) to develop privacy theory in ontology-based information systems and 2) to understand how access restrictions can guarantee privacy.
Although systems that use access policy are widespread, it was very interesting to see how policy violations may occur not only from a single query against restricted knowledge, but that by composing a sequence of seemingly innocuous queries, it might be possible to violate the policy. To formalize (and keep track of) this background knowledge in such a way that policy violations could be detected is certainly an interesting research problem.
My only critique of the paper is that it states that "... they have also left many open problems and further research is needed before they can be incorporated in practical systems." with the justification of "Due to lack of space". I really believe that this needs to be clearly articulated, if only briefly. Early components (1,2) of the paper can be reduced or the technical details in 4 omitted to provide the space required.
Review 2 by Paulo Pinheiro da Silva:
The short paper is very interesting although the content needs to be presented in a more rigorous way and probably further reviewed by experts in privacy. The motivation lacks a more extensive review of privacy literature (as opposed to semantic web literature). The proposed solution for ontology-based privacy is based on assumptions that may not be easily accepted by privacy researchers. These issues are elaborated below.
In the background section, the author mentions that users can indirectly retrieve information via logical inference. This notion of indirect retrieval of information needs to be clearly defined since many are possible ways of combining information retrieval and information derivation and many of these possible combinations are not discussed in this paper. Later in the Background section the author starts to elaborate on this with the John's example. However, the example makes the understanding of this notion of indirect information retrieve even more obscure.
The term 'certain answers' needs to be defined because it appears that the author assumes answers to be certain even if they are based on uncertain data (e.g., most clinical data).
In the general challenges, the author uses the term ontology as a synonym for knowledge base. This is fine for me but certainly not okay for other readers. In a more restrict interpretation, ontologies would not contain most of the statements supporting facts such as that that Bob, John and Dr. Andrew are of type person. With that in mind, phrases such as 'data of the ontology' would not make sense.
A more relevant comment for this section is the fact that the author is exposing privacy issues related to knowledge derived by deduction. This is the natural first step for a privacy study in the semantic web context. However, in the privacy community, the real issue is the exposure of knowledge derived by induction and the literature in this field is extensive. This is exactly the point where this paper lacks a stronger connection with state-of-the-art work on privacy.
To further elaborate the case above, the need for issues (2) and (3) in the proposed design of a privacy-preserving system section require a better justification. We understand that (1) is an important issue but once a privacy policy is in place, users should be able to query whatever they are capable of accessing (issue 2). Moreover, the system cannot control how query answers will be used by users because users can always interact with other users (issue 3). An interesting case to consider is the NetFlix fiasco of creating an anonymized test data that was later reversed and created a very uncomfortable situation for the company (http://www.securityfocus.com/news/11497). The paper needs to consider anonymous users including robots and other information scraping agents. As the author can see, the proposed solution is still very incipient and probably not ready for publication. A more comprehensive motivation would probably be a more appealing.
Other issues:
- (throughout the paper) Extensive unnecessary use of the 'the' article. For example, in the abstract, we could have "preventing unauthorized access to data and knowledge in ontologies" instead of "(…) to the data and the knowledge in the ontology";
- (section 1) "OWL ontologies can be used to process data" - I am not sure ontologies can process data;
- (section 1) "could be disastrous" – how?
- (section 2) "whose security policy" – replace by "whose privacy policy"?
- (section 2) "secret information" – replace by "restricted information"?
- (section 3) "the design a privacy preserving" – add 'of'
- (section 3) Consider related work on partial import both at the owl spec and linked data
Comments
Food for further thought
The paper makes a very strong case about privacy protection and reasoning. In particular, it tries to prevent that a user cannot deduce from the answers given to queries, information that the system does not want to disclose. This is indeed a very important problem which arises effectively because we are in a semantic environment.
Paulo Pinheiro da Silva mentioned that this is only the first step (but a necessary step). I wanted to comment as well on this point and offer a further complementary challenge which brings us one step further on the reasoning scale.
Indeed, the paper considers "Formalization of users' prior knowledge" but it should also take into account the knowledge that the user may have on the system itself and that it may use to deduce prohibited answers. It may be that the user can infer an answer from the single fact that the system refuses to answer some query.
For instance, imagine a laboratory that can only provide two types of analyses, one of which would directly indicate a pathology that I am not supposed to know if diagnosed and the other one far more open in that it can be about various diseases (so in terms of information theory I would learn only one bit, the test). If the system refuses to answer my request about which analysis has been performed, then I can legitimately deduce that the first test is involved, and thus what is the diagnosis. I cannot deduce this if I do not ask or if I do not know the principles on which the system is based.
This is only a simple example, it is easy to design more complex ones.