Making Sense of Social Media Streams through Semantics: a Survey

Paper Title: 
Making Sense of Social Media Streams through Semantics: a Survey
Kalina Bontcheva, Dominic Rout
Using semantic technologies for mining and intelligent information access to social media is a challenging, emerging research area. Traditional search methods are no longer able to address the more complex information seeking behaviour in media streams, which has evolved towards sense making, learning, investigation, and social search. Unlike carefully authored news text and longer web context, social media streams pose a number of new challenges, due to their large-scale, short, noisy, contextdependent, and dynamic nature. This paper defines five key research questions in this new application area, examined through a survey of state-of-the-art approaches to mining semantics from social media streams; user, network, and behaviour modelling; and intelligent, semanticbased information access. The survey includes key methods not just from the Semantic Web research field, but also from the related areas of natural language processing and user modelling. In conclusion, key outstanding challenges are discussed and new directions for research are proposed.
Full PDF Version: 
Submission type: 
Survey Article
Responsible editor: 

Resubmission after "accept with minor revisions", now accepted. First round reviews are below.

Review 1 by Harald Sack

The authors provide a survey on state-of-the-art data mining of social media streams based on semantic technologies, natural language processing and user modelling.
The survey in organized according to 5 key questions:
(1) Ontologies and Web of Data resources for representing and resoning the semantics of social media streams
(2) Semantic Annotation to capture implicit semantics of social media streams
(3) Extraction of reliable information from social media streams
(4) User modelling for social media streams
(5) Semantic-based information access to social media streams

After a concise introduction, section (2) provides a categorization of social media resources, followed by the identification of key social media sites, and the challenges of social media resources concerning reliable information extraction. Section (3) enumerates and introduces various ontologies used for representing social media resources as well as for modelling user behaviour.

For a survey I would prefer a more structured representation of the mentioned ontologies instead of a simple enumeration (at least in the same way as the representation of the key challenges in section 2)

Section (4) provides an overview of semantic annotation of social media resources as well as various mining approaches ranging from keyword extraction, ontology-based and WIkipedia-based entity recognition, over event detection, to sentiment detection, opinion mining, and cross-media linking. The section is finished with an in depth discussion that could also be better structured for readability. Overall the structure of the entire section is straight forward and built up one after the other in a logical order. The challenges and limits of applying these mining techniques on social media streams and microposts are worked out well. More care could have been taken for discribing and critically discussing crowdsourcing based methodologies.

Section (5) is focussed on semantic-based user modelling and how user interests can be derived from semantic annotations. User demographics can be analyzed via location data. Here also time dependency of user interests and their development over time is an important subject to investigate. The discussion of this section including the merging of heterogeneously created user modells as well as distinguishing personal from global interests could be more detailed.

Section (6) covers the information access to social media strreams, starting with semantic search, information filtering from various streams, followed by different means of visualization.

Section (7) summarizes the current challenges in social media mining such as cross-media aggregation, (web-scale) scalability, and the demand for standardized evaluation. This is one of the most important parts of the paper and could be worked out in more detail.

Overall this is a very valuable compilation of the current state-of-the-art in semantic-based social media stream analysis. In general for better understanding and readability the single topics could be supported by tables and diagrams to visualize the complex structure and relationships presented in the paper. Discussions could be summarized by tables showing the pros and cons of different tools or approaches. Bibliographical references for evaluation based on crowd sourcing by non-specialists are missing.

Minor issues:
p1. ...and social search [?] - missing reference
p5. use the acronym 'SKOS' without further explanation or reference

Review 2 by Ashutosh Jadhav

Authors have presented a comparative review of how semantic web (and natural language processing) techniques are used to address some of the research challenges associated with social data. Specifically authors focused on challenges in social media and conducted a detailed survey of current work in semantic web and social media including use of ontologies, semantic annotations, user modeling and information access. In the context of addressing social media challenges, authors also covers recent work on natural language processing, visualization techniques.

Good points:
1. Although the scope of the topic of the paper is very vast, authors have managed to touch upon majority of the important research problems and related work. The paper provides a summary of various challenges, possible approaches, and their suitability, performance etc.

2. Authors have reviewed and cited most of the recent work in respective sections.

3. This is first comprehensive survey on use of semantic web coupled with natural language processing techniques to address variety of issues with social data.

4. Most of the sections are well written, providing summarization of recent work, comparative study and authors' views points towards the challenges and possible solutions.

Weak points:

1. The organization of the paper can be improved. For example authors have discussed too many topics under semantic annotation section. Can a taxonomy/classification or an organizational structure provided to make it easier for a reader to distinguish/differential different issues? Please have a look at this referance for organization

2. Most of the topics discussed by authors in this paper are important and provide comprehensive understanding of state of the art work in social media. At the same time authors covers some topics (for example sentiment analysis) that are heavily depended on natural language techniques and use very less or no semantics at all. Since the topic of this paper is use of semantic web in addressing social media challenges, authors could have limited their scope to topics, which use semantics to certain extent to solve the problems. Alternatively, authors can discuss how use of dictionaries (eg urban dictionary), machine learning or background knowledge could be applied for this topic (eg ICWSM2012 has a paper on topic specific sentiment, where identification of topics to which sentiment is associated utilizes light-weight semantics.

3. Some of the sections are not written well for example:

a) The section 2.1 Key Social Media Sites can be improved in terms of flow, summarization of various types of networks, their characteristics etc.
b) The section 4.1.1. Global topics can be improved in terms clarity

4. One of the major section in the paper is Semantic annotation. Authors have summarized what and how of semantic annotation but haven't discussed much about why semantic annotations are useful.

Minor point

1. Reference is missing on page number 1 ([?])

2. Paper has some grammatical mistakes like "These graph-based approaches to extracting keywords from Twitter..."

Review 3 by anonymous reviewer

The paper presents an excellent survey of the state-of-the-art approaches about various technologies for mining and intelligent analysis of the user-generated content in Social Media. To the knowledge of this reviewer, this work is unique of its kind and presents the most comprehensive up-to-date analysis of the literature in this emerging field of research. The paper is fun to read. It is well-written, easy to follow and summarizes the state-of-the-art very nicely. There are just a couple of comments regarding how to improve the paper.
Page 15: The authors point out the lack of lexical knowledge for processing user-generated content. Besides Wikipedia which has by now become an established resource in text analysis, Wiktionary has been found to be very valuable for these purposes [1,2]. Its particular advantage over standard lexical semantic resources is the inclusion of the terms specific to the user-generated content on the Web. Thus, it might be effectively utilized for analyzing social media content, and there are some ready-to-use tools available for that already:
Another recent trend in the community of language resources is linking and merging large scale resources to increase their coverage and make them more useful in broad-coverage text analysis. This is a link to the corresponding recent workshop: A particular example of a linked lexical-semantic resource is UBY from the same group, as the Wiktionary resource mentioned above [3,4].
The authors mention crowdsourcing as a possible way to improve the performance of automatic systems. This topic has received a lot of attention of different communities in the recent years. Human computation, collective intelligence and games with a purpose can thus be discussed in greater detail. There are numerous references for that, which could make a very nice separate section in the context of this article, for example [5].
Page 26: Multilinguality is mentioned as one of the major challenges with most of the methods being developed for the English content only. This reviewer would like to see more discussion on what has been done for other languages and how the problem can be tackled. Which technologies are to be research intensively to address this issue?

[1] Christian M. Meyer and Iryna Gurevych. Wiktionary: a new rival for expert-built lexicons? Exploring the possibilities of collaborative lexicography. In: Sylviane Granger and Magali Paquot: Electronic Lexicography, pp. (to appear), Oxford: Oxford University Press, 2012.
[2] Christian M. Meyer and Iryna Gurevych. OntoWiktionary – Constructing an Ontology from the Collaborative Online Dictionary Wiktionary. In: M.T. Pazienza & A. Stellato (Eds.): Semi-Automatic Ontology Development: Processes and Resources, Hershey, PA: IGI Global, 2011, pp. 131-161.
[4] Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian Wirth. Uby - A Large-Scale Unified Lexical-Semantic Resource Based on LMF, In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), pp. 580-590, April 2012.

Submission in response to