Privacy Aware and Faceted User-Profile Management Using Social Data

Paper Title: 
Privacy Aware and Faceted User-Profile Management Using Social Data
Owen Sacco, Fabrizio Orlandi, and Alexandre Passant
In the past few years, the growing number of personal information shared on the Web (through Web 2.0 applications) increased awareness regarding privacy and personal data. Recent studies showed that privacy in Social Networks is a major concern when user profiles are publicly shared, revealing that most users are aware of privacy settings. Most Social Networks provide privacy settings restricting access to private data to those who are in the user’s friends lists (i.e. their “social graph”) such as Facebook’s privacy preferences. Yet, the studies show that users require more complex privacy settings as current systems do not meet their requirements. Hence, we propose a platform-independent system that allows end-users to set fine-grained privacy preferences for the creation of privacy-aware faceted user profiles on the Social Web.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Guest Editors
Major Revision

Solicited review by Bettina Berendt:

The paper proposes a scheme for combining data from different social networks and then revealing it selectively (the authors call this "privacy aware").

The ideas are nice, the scheme appears to be a well-worked-out piece of engineering, with a good overview of existing work, but the notion of privacy is viewed too naively and based on a too-small and biased selection of literature. The idea of how to disclose data selectively does not really add anything new to existing attribute-based access control policies (although the combination with FOAF/Semantic Web appears to be new). The user study is methodologically weak. These two factors gave rise to my "neutral" scores for methodology and literature review.

One key idea is to reveal data based not on person/identity, but based on properties. This is a nice idea, consistent with Semantic Web ideas, and also consistent with new developments in social networks and privacy research. However, it is based on an assumption that runs counter to everything that we have learned in discussions of privacy during the past decades: that people are honest. It's very nice to say that "everyone from my workplace should have access to my telephone number, and no-one else should" - but this is extremely dangerous in an environment that cannot ensure that people tell the truth about their workplace!

Another point: If you write from the perspective of privacy, then the unchecked assumption that people will *want* to merge their different identities is at best naive. From a privacy perspective, separated identities can be extremely desirable. This should be discussed.

Section 2.1.4 has very long paragraphs and an unclear argumentative structure. Please revise.

On p.7, you say that you selected "some of the most interesting systems from our perspective". This is underspecified and needs to be explained better.
The concept of "facet" remains somewhat fuzzy throughout the paper. This is quite confusing for a reader who thinks of it as a well-defined term (faceted search). Please make sure this "overloading" is taken into account and the term well-defined.

p.8: It is true that P3P itself does not allow users to specify anything; but it does have complementing technology to do this, e.g. APPEL. Having said that: it is being argued that "P3P is dead". This article would be a good place to reflect this.

On p.13 ff., it seems that your idea that data should be revealed based on profile information is getting the better of you. First, why concentrate solely on this - why not also give the option to decide based on identity? Second, profile information can be fake, see above. Third, as Fig. 17 shows, only an - even if big - minority of users you asked said they wanted this. Your interpretation of this (or rather, of the 56% who did NOT want it) appears quite biased.

Section 4.1.2 and 4.1.4 are hard to understand. Pls consider giving examples. The clause "of which property describes the literal" appears incorrect - do you mean "which property is described by the literal"?

p. 16: "Algorithms such as [30] can be applied to identify whether or not the person has the authority to create the privacy preferences on a particular dataset" - this needs a bit more explanation.

Pls give more information on the users that you surveyed. (E.g., how were they recruited / what was the incentive to participate, demographics, social-network sophistication, ...)

The study itself is not described in much detail, but it seems to be designed in a somewhat too straightforward way (e.g. no control questions).

Analogous remarks apply to the user study described in Section 7.

p.20: "Our aim is to study whether users are satisfied with our approach ..." -> to find this out, you would also need a baseline.

After finishing your article, I was wondering: to whom do the users disclose something? the web? another social network? an API? Pls clarify.

Minor points:

* all session references are broken. Pls fix.

* p.2: deduce -> induce

* p.7: fine-grained access control statements -> not yet defined. Give at least an intuitive understanding.

* p.8: "this requires the user to define the same action each time with different objects" -> unclear and hard to parse. Pls rephrase and supply an example.

* p.14: Fig. 5 is unclear, please briefly explain the namespaces close to the figure so readers do not feel they are expected to know them.

* p.19: considerate -> considerable

* the last para before Section 7 is unclear, pls revise.

Solicited review by David Vallet:

In this paper, authors present a framework and vocabulary to support fine-grained privacy controls on Social Web data. The framework is well presented and motivated.
The paper in its present form is quite complete, although in case that the paper is accepted, I would consider essential to have in the final form of the paper a higher number of users in the evaluation of section 7.
Benefits of taking a semantic approach to represent the privacy constrains should be highlighted more clearly by authors. For instance, it was not clear to me if the semantic representation of constrains allows the execution of standard reasoning engines, or everything was done through ad-hoc procedures.
The evaluation section is very comprehensive (maybe too much). I would recommend a better restructuring of the subsections. Something similar to what is shown in table 7. Sections such as 2.1.2 are really hard to follow, for instance. Maybe some content can be cut off the sections that cover general domains.
On a further note, I would like to see some initial statements on how your framework may be integrated into current social services. We know that initiative such as Google's social graph API is a step forward on integration of user profiles, but I wonder how you envision that your privacy framework may interact with these of other types of initiatives.
As users in the evaluation stated, setting privacy settings on individual preferences does not seem to be the way to go, as it may be too cumbersome to do. As suggested, a clustering approach may fit better -- perhaps DBpedia could be exploited to group concepts. Also the representation of preferences as single weighted concepts can be improved, by allowing the representation of complex preferences that may have not a category in DBPedia, such as "music of the 80's", "Movies directed by Steven Spielberg" and the like.

Minor comments:
- Please fix the multiple broken "Section ??" references.
- It seems more natural that the global interest formula presented in eq. 1 included a normalization factor, taking into account the scales of each source.
- Eq. 2 could artificially favor concepts that commonly appear on the three of the studied concepts. Some concepts are more prone to appear in one source that in other (e.g. location may appear more often in Facebook)
- In section 5, I do not agree with the statement "This illustrates that users are unhappy with the current implementation of privacy settings", as only 2 of the 70 users stated that they did not trust privacy on current social web applications.
- I wonder if our approach allows the creations of groups, similar to Google+, it was not clear to me.
- Check the sentence in sec. 7: "The users did not have any problems in getting used to the system which in fact it lasted the user between" => The "which" seems a bit off here.

Solicited review by Till Plumbaum:

The authors present a system that allows users to aggregate their user profiles from different social networks and set fine-grained privacy preferences for these aggregated profiles. The paper covers two important questions, how to aggregate distributed user profiles and how to enable users to set and manage privacy preferences? Two motivations for this research are presented. First, the authors state that today's privacy mechanisms in existing social networks do not fit the needs of the users. Second, such a profile aggregation could help to overcome the cold-start-problem as more information collected from different applications help to get a better picture of the user. I agree with both, the motivation and the resulting research questions. However, the mentioned problems are not new. A lot of work in both areas, profile aggregation and privacy management, is already done and are well presented by the authors in the related work chapter. This shows that the authors are well aware of existing approaches, unfortunately no discussion of the existing approaches and no highlighting of the novelty of the presented system is provided. Thus, the reader is left in dark about the contribution of this work. The following description of the system is subdivided into two parts, the user profiling and the privacy mechanism.

The user-profiling process consists of different steps. First, the user data is collected from different SNS. Handwritten scripts that collect data directly from the web page or using existing authentication APIs do this. The collected data is then transformed into a semantic representation (using existing ontologies like FOAF, SIOC, DOAC, WI, WO etc.). Collected user interest information is weighted during the aggregation process, so that in the end a big user profile exists that includes all user information.

I have several questions about this approach. First, the overall description of the user profiling process is very vague and lacks of important information that is needed to judge about the validity of the general idea. In the related work, the authors identify some important points in the aggregation process like user identification, the aggregation of different user model representations or heterogeneity of user information. None of these points are really addressed in the paper. My main questions are:
- The data collection process is described briefly.
- What are the challenges in collecting information from the SNS, e.g. Twitter, and publishing them as RDF triples?
- How is the transformation from the application dependent user profiles to the FOAF profiles done? Manually, using existing ontology matching approaches, etc.?
- The whole interest computation process is described very sketchy. The presented formula only consists of two variables: ws and wi.
- How is ws computed and how wi? The authors only claim that for wi „many different factors can be considered to influence the weights of the interests". And for ws? A little more explanation would be beneficial.
- During the aggregation, a duplicate filtering is done "and this is done automatically by a triplestore during the insertion of the statements". I assume this means that duplicate triples are just overwritten.
- How are near duplicates or synonyms handled?
- What about contrasting interests?
Following the user-profiling component description, the privacy preference ontology is explained. My main concern with this section is, it reads like a technical report. I would expect an ontology description to cover
- the design principles, methodologies applied at creation,
- a comparison with other ontologies on the same topic.
One of the given examples says a goal of the "grant write access to this picture gallery only to people I've met in real-life". Where is the "met in real life"-information coming from?

In section 5, the authors present a user study to determine the requirements for the design and implementation of the Privacy Preference Module. I like the general idea of the presented user study. But to understand and classify the results, some more information about the people surveyed are needed.
- How many women/men participated? What is their profession? If only computer scientist were surveyed, than the results for question one are kind of expected while this may not true for other groups.
- How was the survey conducted? Questionnaires send out by letter, online etc.?
- Were the seven percent who answered question 3 with "no" excluded from the rest of the analysis? If they have no interest in privacy settings at all, their opinion is not relevant for a privacy questionnaire.
- One concern I have is that the way the questions are formulated can affect the answers. "If provided by the system, would you different privacy settings" leads in my opinion to more positive answers.

In general, the idea presented in the paper is interesting and relevant to the topic of the journal. But, there are many aspects that need clarification or modification. Also, it is hard to say what exactly the novel contributions of the approach are.
Minor remarks:
- The paper changes between American and British English.
- The reference for TWARQL has no connection to TWARQL .
- There are a few empty sections (between 2, 2.1 or 2.2, 2.2.1).
- The system evaluation in section 7 should be extended. No conclusion can be drawn based on seven users.