Social Influence Analysis in Microblogging Platforms - A Topic-Sensitive based Approach

Paper Title: 
Social Influence Analysis in Microblogging Platforms - A Topic-Sensitive based Approach
A.E. Cano, S. Mazumdar, F. Ciravegna
The use of Social Media, particularly microblogging platforms such as Twitter, has proven to be an effective channel for promoting ideas to online audiences. In a world where information can bias public opinion it is essential to analyse the propagation and influence of information in large-scale networks. Recent research studying social media data to rank users by topical relevance have largely focused on the “retweet", “following" and “mention" relations. In this paper we propose the user of semantic profiles for deriving influential users based on the retweet subgraph of the Twitter graph. We introduce a variation of the PageRank algorithm for analysing users topical and entity influence based on the topical/entity relevance of a retweet relation. Experimental results show that our approach outperforms related algorithms including HITS, InDegree and Topic-Sensitive PageRank. We also introduce VisInfluence, a visualisation platform for presenting top influential users based on a topical query need.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 

Submission in response to

Revised submission after an "accept with major revisions" and a subsequent "accept with minor revisions", then accepted for publication. Reviews for round two are below, followed by the reviews of the first round.

Solicited review by Claudia Wagner:

The authors adressed most of my comments. The readability of the paper is a lot better now. Their work is very interesting and definitley relevant for this journal.
However, some things are still unclear for the reader from my point of view and therefore I have few further comments and suggestions for the authors:

- section 6 Evluation:
It is unclear how the topic-senstive PR was applied - e.g. with how many topics?
How did the authors chose them?

- section 6.2.
why is is good if an algorithm shows high correlation with ID?
Only because wefollow uses it? I am even not sure how the authors know what wefollow
uses since they do not disclose it - if they did somewhere I would like to see a reference.

- Definition 3
this measure is hard to understand - authors should explain what an Q value of 4 means
by giving an example like.

-Table 4
Average values alone mean nothing. It would be good to show variances or even plot the Q values per topic and algorithm
with box plots. That would allow the reader not only to compare the means but also assess the disdtribution of Q values.

-Def 3
there is a typo --- "a higher rank that"

- Conclusions:
the authors point out that they explain now in the conclusions the difference bewteen TSPR and their algorithm and explain why they think it performs better.
For me as a reader this is still not entirely clear. If the authors want the readers to believe that their algorithm really outperforms
the state of the art they should elaborate the explanation a bit - maybe add a seperate discussion of results.

Solicited review by Guillaume Erétéo:

This paper is particularly relevant for this issue of the semantic web journal. I like the proposed approach and I appreciate that the authors have taken into account my previous comments.

First round reviews:

Solicited review by Sofia Angletou:

This paper investigates the very interesting and complex problem of identifying most influential users on certain topics in Twitter. I think the contribution of the paper is substantial, it describes a large piece of work and it should be included in the journal. The authors propose a methodology in which the tweets are broken down into various elements, semantically enriched using two well known semantic platforms, triples are then generated, using open vocabularies, to describe the relations among users, tweets, entities, topics and time. These triples are then processed by influence calculation algorithms. The performance of the approach is evaluated against well known influence detection algorithms. The authors also present an interface which visualises the temporal influence of a group of users on a selected topic, along with clouds of tags, links and entities.

I have a few criticisms and questions that I would like to be answered before the publication of the paper.

First, there are some language and terminological issues throughout the paper. There is a significantly large amount of grammatical and spelling errors (we propose the user of semantic, a user has to main roles, set users you followed, Based on these elements we we, and many more). Second, the paper would benefit from a definition section where concepts would be clarified. For example, in the introduction it's stated that "..which users become influential in a particular topic and in a particular entity. For example... audiences interested in tennis player..." In section 2 it's stated "... Between a user and the entities (e.g., person, products) mentioned on the content of his posts or retweets." What is an entity and how can a user be influential in (or about) a particular entity should be clarified. In the beginning of section 2 you mention the concept of semantic trail which then becomes semantic trace. This needs to be explained.

Why in Section 3.1 there is no mention of being followed as a writer but only mention of being listed? The last paragraph of 3.1 about twitter privacy seems completely out of context there and I don't understand how it contributes to this section.

Finally the interface, although impressive, it's not clear how it's beta version will work online. All the analysis presented here is based on data collected from the streaming api. Does that mean that VisInfluence will perform this analysis on data collected from the api which will be constantly streaming all tweets? Have you considered the scale issues imposed by such a solution?

Solicited review by Claudia Wagner:

The authors of this paper present an algorithm to identify influential Twitter users and compare it against existing algorithms such as HITS, PageRank and topical PageRank. They also present a platform to visualize influential users.

They evaluate their algorithm in a retweet-prediction task and assume that the most influential user is most likely to be retweeted
From my point of view this evaluation has 2 main problems:
1) Previous research has shown [1] retweets are motivated by many different factors. The assumption that the most influential user is the one who is most likely to be retweeted seems to be questionable.
2) the task of predicting whom to retweet seems to strange in general since one usually wants to suggest which tweet to retweet or whom to retweet in which context (e.g. you might be interested in retweeting the semanticweb-related tweets of this person).

In general, the paper addresses a very relevant problem but is quite hard to read and leaves too many questions open for the reader. For example:
- Do the authors aim to identify the users who will be retweeted by most others in the context of a given topic (global topical influence) or do the authors aim to predict if a user A will retweeted by a user B (personalized retweet recommendations)?
- Why and how do they think incorporating entity information should outperform topic information? Especially, if they do not exploit the entity-type information?
- How does their approach differ form the topic-sensitive PR and what can we learn from the fact that their approach outperforms the topic-sensitive PR?
I am wondering if one possible explanation for that is that Zemanata and Open Calais return simply more entity-instance than topics. I think it would be good if they could share some insights on how the size of the bias-vector and the type of bias vector influence their results.
Is it really important that we have entity-instances in the bias vector or could we use simple keywords as well?

* In the abstract the authors state that they use semantic user profiles for deriving influential users based on the retweet subgraph of the Twitter graph. It would be good if the authors could explain at some point who they generate those profiles and how those profiles look like. Since the improvement of their algorithm seems to depend on these profiles, it seems to be crucial to understand what these profiles are.

* In the introduction the author give the following example: A user can be influential in Sports News, but not reach audiences interested in tennis player Roger Federer.
What does this mean? Does it mean that tweets of this user about sports are likely to be retreated unless they are about Roger Federer?

* Table 2 should also compare HITS - ID, HITS -TSPR and ID - TSPR. Otherwise it is hard for the reader to compare kendall values. It could be the case that existing algorithms have a higher rank correlation when compared against each other than when compared against the new algorithm introduced in this paper.

* Interpretation of results shown in Table 2 is unclear. the authors point out that for
the first 5 topics their approach has a higher ranking agreement to the topical-in degree algorithm (ID). However, for the reader its unclear what that means? Is that good and if yes why?

* Section 5.2: In this section the authors introduce and describe their new algorithm, the Topic-Entity PageRank. As far as I understood the new thing about this algorithm is that it incorporates entities rather than only topic.
It looks like they use instances of entities without taking the entity type information into account. If that is correct than I wonder whats that differences between the topics they use and the entity instances?
It would be good if the authors could clarify that.

* I highly suggest that the authors improve section 6.3. and describe in textual form how they do the evaluation and what their evaluation metrics mean.
Definition 3 is hard to understand and there are some spacing-problem. It is unclear to me what one can see from Table 3.

* After Definition 3 the authors write: "Table 3 presents the average Q values over all the 18 topics." For the reader it is unclear about which topics they are talking at this point in time.

* Header of Table 3 seems to be inappropriate.

minor changes / typos:
*) abstract:
In this paper we propose the *user* of…
*) section2: encoding problems in the last paragraph
*) section 9: *Twiter*

[1] boyd, danah, Golder, Scott, and Lotan, Gilad. (2010). Tweet Tweet Retweet: Conversational Aspects of Retweeting on Twitter. Proceedings of HICSS-43.Kauai, HI January 5-8.

Solicited review by Guillaume Erétéo:

This is a nice paper that deals of a subject that is important for this special issue of the Semantic Web Journal on Microposts.

This paper proposes a novel method to measure the social influence of Twitter's users in different entities and topics. The authors use Zemanta and Open Calais in order to semantically enrich tweets with the topics and the entities that are discussed. Then they propose an algorithm to determine the influence of users in the detected topics and entities focusing on the retweet network.
This method was experimented with a real Twitter dataset collected by the authors.

The results that are obtained with this algorithm are compared with other algorithms presented in the state of the art: hits, in-degree and topic-sensitive PageRank. The algorithm is well formalized and nicely experimented.

This paper should be accepted but with the following revisions.

First, you should give more details on the analysed dataset. In addition of the different power law distributions, it would be useful to provide other graph metrics like density, diameter, number/size of components, average distances between users.

Then, the evaluation content is interesting, but the paper would benefit to develop it with more details and a deeper description of results. In particular I would like to know how you have selected the topics used in table 2 for the analysis of the correlations among the compared methods.

Finally, I am a little bit unconfortable with the paragraph about visinfluence.The description is pretty short, without any evaluation of the usefulness of this interface. It's interesting to discuss a concrete application of this algorithm, but in this case this section should be better introduced with more cohesion with the presented algorithm and it would be nice to develop the benefits of this tool.

The algorithm is the most important contribution of this paper and its evaluation should be prioritized. So, if you lack of space, you should better remove section 7 in order to develop the evaluation and the description of the analysed dataset.

"since measures for analysing three-mode graphs is still an area of research, it is common to study this type of graphs by taking 3 two-mode networks"
==> you describe your data in RDF which can be seen as a rich typed graph model. SPARQL enable to directly query this graph without the need to use intermediate representations. Moreover, the expressivity of SPARQL 1.1, which is implemented in most semantic engines, enable you to compute complex graph measures on RDF graphs. So I think that it could be relevant to delegate most of graph operations to SPARQL and avoid using intermediate graph representations.

minor details :
title of the header:"instruction for the preparation of a camera ready paper in LATEX" ==> set it with yours :-)
page 2, beginning of section 3: "based on these elements we we describe" ==> "based on these elements we describe"
page 4, end of section 3.3: "(3(c))" ==> "(Figure 3(c))"
page 5, end of section 4.1: space between lines is too important
page 6, section 4.2: "as a a qualified" ==> "as a qualified"