User profiling on Twitter

Paper Title: 
User profiling on Twitter
Edgar Rocha, Alexandre P. Francisco, Pável Calado, H. Sofia-Pinto
Social networking and microblogging integrating services, such as Twitter, have been gaining popularity in recent years. In this context, the study of user activity and information flow raises several interesting questions, with important real life implications, such as user influence prediction and information flow optimization. In this paper we study how to differentiate users given their activity. We focus just on user activity, ignoring the content of messages a user exchanged. Unlike previous work that focus on user activity and content of messages user exchanged, we take into consideration both social interactions and tweeting patterns, which allow us to profile users according to their activity patterns.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Guest Editors

Solicited review by Fabian Abel:

A. Summary of content of the paper:
The authors aim for classifying users on Twitter based on the users' activities in the Twitter network. The paper reports on a dataset crawled from Twitter, strategies for preparing/cleaning the dataset and a first analysis in which the authors cluster the users (based on features such as number of followers, number of (re)tweets or number of times a user was involved in certain types of re-tweet chains) and describe (the user they find in) the clusters.

B. Summary of review:
The article/work is at an early stage and mainly reports about about methods for preparing the dataset to facilitate future analysis. Some of those data preparation steps are necessary because of the used crawling strategy. Using other crawling strategies (e.g. via Twitter's streaming API) would avoid the necessarily of those steps. Moreover, for the given purpose of user profiling, the crawling method is not very appropriate as the dataset does not feature a continuous picture of tweets published by a specific user. The analysis does not go beyond describing users that are based on their activities in Twitter's social network assigned to (unlabeled) clusters. The scientific value of the results is therefore limited.

The article is moreover not concerned with core Semantic Web topics. For example, the authors do not analyze the content of the tweets published by the users or try to sense the semantic meaning of Twitter messages. Hence, relevance to the journal and the special issue (Personal Social Semantic Web) is limited as well.

I am confident that the research that is presented in the paper will lead to exciting results that can be published at venues such as WWW, ICWSM, UMAP or related journals.

C. Detailed comments:
The summary of related work is very detailed, however misses some interesting works that relate to user profiling. For example:

[1] Hecht, B., Hong, L., Suh, B., Chi, E.H.: Tweets from Justin Bieber"s Heart: The Dynamics of the "Location" Field in User Profiles. In: Proceedings of the 29th International Conference on Human Factors in Computing Systems (CHI), New York, NY, USA, ACM (2011)

[2] Golbeck, J., Hansen, D.L.: Computing Political Preference among Twitter Followers. In: Proceedings of the 29th International Conference on Human Factors in Computing Systems (CHI), New York, NY, USA, ACM (2011)

[3] Fabian Abel, Qi Gao, Geert-Jan Houben, Ke Tao. Analyzing User Modeling on Twitter for Personalized News Recommendations. In Proceedings of International Conference on User Modeling, Adaptation and Personalization (UMAP), Girona, Spain, July 2011

The used crawling method is not optimal for the task the authors had in mind because it results in an incomplete picture of the user (Twitter activities of a user are not continuously monitored, e.g. the average number of tweets per user (after filtering) is with 2.87 tweets very low). Crawling in a snowball manner might be more appropriate. For example:
i. start with trending topics
ii. monitor users that are found in step i
iii. extend with users that interact with users found before

This could be implemented by using the Twitter streaming API. As the authors aim for user profiling it is important to continuously follow/monitor users so that one can classify the user behavior.

The identification of re-tweets raises some questions. For example, how accurate is this method? Why is the information delivered in the JSON representation of a tweet not used (Twitter indicates whether a tweet was re-tweeted)? Using information available in the JSON representation of a tweet one could accurately construct re-tweet chains.

Some of the tables and figures are not explained sufficiently and/or could be removed. For example, in Table 1 it would be good to add "XY et al." so that it is easier to understand the tabl. Fig. 3 and Fig. 4 do not add much value, I think. They could be removed or joined into one smaller figure. Table 3 should be discussed in more detail.

A more detailed description of the clustering method as well as further motivation on why the clustering is done would be nice. Are there any practical benefits of knowing that a user belongs to a certain cluster? (e.g. does it help for recommending tweets or predicting re-tweet behavior?)

The conclusions drawn from the analysis reflect that the work is at a very early stage. Knowing that there are mainly two clusters that together contain almost all users and allow for differentiating between active and non-active users is not a breakthrough. I think the approach of first trying to analyze what types of users can be found on Twitter is very promising. However, one would need to do the classification based on a complete/continuous picture of individual user activities. In order to better understand the different types of users, it might be useful to also plot diagrams that show distributions without clustering the users, e.g. x-axis = users of Cluster X, y-axis = frequency-based feature of user X. In a next step one could start to measure correlations to figure out whether, for example, users who have many followers participate significantly more often in certain types of re-tweet chains.

D. Minor comments:
- Abstract: work that focus -> work that focuses

- introduction: side note: main news agencies actually know quite early what will dominate the news in the future (e.g. based on schedules of politicians, planned music/sport events, etc.) -> I would argue that "most of the time even before news media take notice of them" is not true

- introduction: what is "tweet nature"?

- related work: "To the best of our knowledge, none of previous re- search work on user profiling in Twitter explores dy- namic pattern features." -> we [4] actually did a study on this which was motivated after reading [5].

[4] Fabian Abel, Qi Gao, Geert-Jan Houben, Ke Tao. Analyzing Temporal Dynamics in Twitter Profiles for Personalized Recommendations in the Social Web. In Proceedings of Proceedings of ACM International Conference on Web Science (WebSci), Koblenz, Germany June 2011

[5] Huang, J., Thornton, K.M., Efthimiadis, E.N.: Conversational tagging in twitter. In: HT "™10: Proceedings of the 21st ACM conference on Hypertext and hypermedia, New York, NY, USA, ACM (2010) 173""178

- followers and friends -> would it be better to use the terms followers and follows?

- it might be good to name the algorithms/methods (in Section 3), e.g. using some listing/algorithm environment

- Sec. 3.6: subscribe his/her -> subscribe to his/her

- Sec. 3.6, last bullet item, last sentence: "¦chain with tree shape. -> "¦chain with star shape.

- Sec. 3.6: user which main objective -> user whose main objective

- Sec. 4.1: in next section. -> in the next section.

Solicited review by Sergej Sizov:

The contribution presents results of a large-scale Twitter analysis. The key idea of this investigation is categorization of users with respect to their activity profile, characterized by frequency and type of interactions, without consideration of actual message content. In doing so, authors present the methodology of data acquisition and regularization, as well as interpretation of user clustering. In summary, the paper concludes that Twitter users can be categorized into five major groups with different cardinalities (active users, lurkers, users with broad inetrests, active retweeters, network hubs).

The contribution is well structured and clearly written. However, it appears of rather low relevance for the Special Issue on Personal Social SW. In fact, semantics (and any sort of latent semantics) plays a secondary role in the discussion. The core concern is applying state of the art data mining methods (that do not go beyond state of the art, however) to regularized Twitter data, in combination with a straightforward data engineering methodology. Despite the practical importance of the presented investigation, its scientific novelty and originality (especially in the field of SW) appears limited.

The methodology of data modelling and data analysis appears clearly explained. However, some conceptual questions remain open. In particular, authors restrict themselves to analysis of user interactions and disregard the content (and thus also context) in which the interaction took place. In this connection, the following important question remains open: to what extent user interactions are useful for user characterization, compared with content-centric and combined models? Moreover, it can be assumed that the choice of the partitioning algorithm and data model influence the resulting clusters as well; so more systematic comparison with other data models (e.g. based on feature subsets) and with alternative partitioning methods (e.g. principal component analysis) can be recommended.

In summary, I do not recommend to accept this contribution for the Special Issue on Personal Social SW in its present form. However, I would like to encourage the authors to extend the contribution and to submit it for a more relevant Journal, related to fields of Data Mining and Social Web (e.g. ACM TWEB or similar).

Solicited review by Qi Gao:

This paper presents methods to characterize users' behaviour on Twitter. To approach this goal, the authors propose algorithms to construct retweet and network chains to investigate information flow and user activities in Twitter. Given a set of features, the standard EM clustering is adopted to generate subgroups of users to differentiate user activity patterns.

Overall the paper is well structured with a clear motivation. The results of the analysis provide interesting insights to interpret different user clusters, however quality (e.g. wrt accuracy) and benefits of the clustering are not evaluated. My first major concern is that semantic techniques do not play an important role in this paper. The retweet chain construction algorithm and the selection of features are mainly based on social interaction, network properties and user attributes. The Content analysis does not really consider the semantics of tweets/retweets. It thus seems that the paper is thematically out of scope for the SW journal. Another worry refers to the novelty of the contribution, which lies narrowly in creation of reweet and network chains. The author did not dive into the user profiling part to show, for example, how the clustering can leverage user profiling or how it is beneficial in an application context (e.g. personalization).

Comments per section:
Title: capitalize "P" in "profiling"

- first paragraph 'news media take -> 'news media takes'
- 'and or initially' -> 'and/or initially'
- 'twitting' -> 'tweeting'.

- I recommend the authors to shorten the related work a little bit (now it is almost 2 pages).
- The author could add more references on the benefits of 'good' user profiling on Twitter in different application contexts (e.g. trends prediction, personalization), such as paper: Analyzing User Modeling on Twitter for Personalized News Recommendations. Fabian Abel, Qi Gao, Geert-Jan Houben, Ke Tao. UMAP2011.

Section 3:
- The two columns of average tweets/retweets are somehow redundant for me. The authors could replace them with the number of maximum/minimum number of tweets/retweets in one day, which will give some impression on whether those topics are rather short-term trendy or constantly popular during the observation period.
- The size of table 2 could be smaller. Or this table could be replaced by a figure.
- I am not sure about the efficiency of the way to construct retweet chain. As I know, a straightforward but efficient way to create retweet chain is based on 'retweetFromPostID'/'retweetFromUser' 'attributes in the response of Twitter API as in: Although it will miss some retweets done manually, which i think will not be too many, it is not clear and convincing for me to use a more complex algorithm instead of that simple one, especially this retweet chain construction algorithm is considered as one important contribution of this paper. The author may need to clarify this in the paper, for example, by showing the efficiency and accuracy of this retweet chain construction algorithm and further more evaluating both of two ways (simplified one using Twitter API and the one proposed in this paper) in the clustering experiments to see their difference.
- $3.6 'defended' -> 'defined'
- The authors should carefully use the term 'classification' since only the clustering experiment was conducted.

Section 4:
- the evaluation is solid in terms of the clustering setting and the interpretation of the clusters, however I expected the evaluation to demonstrate the benefits of the approach for the user profiling, i.e. are those feature sufficient and effective to describe users in terms of their activities. A real user classification experiment similar as in reference [11] could be an option.
- Moreover although the cluster 2 is the largest user group, the other ones are more interesting and worth to further investigate from perspective of either research purpose or market value. For example, the author could evaluate which features make a tweet to be retweeted more. This could be addressed from user attributes such as e.g. number of followers, and more important the content features (e.g. the style of the tweet, the substance of content, which the semantic would be wove into).