Dynamic Discussion Topics Illustrator: a collective knowledge system for modeling social media topics evolution

Tracking #: 3223-4437

Iulia-Maria Radulescu
Alexandru Boicea
Mariana Mocanu
Florin Rădulescu
Daniel~Călin Popeangă

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Social media's ever-growing popularity led to the emergence of the Social Semantic Web, as an assembly of collective knowledge systems. This class of frameworks, algorithms, and tools aims to retrieve, process, and represent knowledge from human contributions. In this paper, we introduce a collective knowledge system to model the transformation of online conversations over time, allowing stakeholders to easily observe trends and behavior patterns. Our framework relies on an original, graph-based algorithm, called Dynamic Discussion Topics Illustrator, that builds "semantic evolutionary maps" of user discussion topics, which we call Discussion Topic Flows. The Discussion Topic Flows result from matching comment clusters from sequential time windows, according to their semantic similarity. The proposed system integrates the following phases: dataset preparation, text clustering, topic extraction, and finally, the employment of the Dynamic Discussion Topics Illustrator algorithm. We exemplify our method in a popular use case: automated extraction of user feedback from online software forums. For this purpose, we collect a real-world dataset of submissions posted on the Fedora dedicated subreddit: r/Fedora, over the entire year 2021. We evaluate the correctness of the results from three distinct perspectives: i) the comment clusters quality, assessed using three popular internal measures, ii) the Discussion Topic Flows' structure, expressed by their length and events quantity, and iii) the Discussion Topic Flows' explainability, measured through their comprising topics coherence.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Luigi Asprino submitted on 03/Oct/2022
Review Comment:

The paper presents Dynamic Discussion Topics Illustrator a framework for extracting topics and their “temporal evolution” from social media comments.
The framework combines standard NLP technologies (WordNet Lemmatizer, Doc2vec, KMeans, LDA) for processing the text and extracting topics from clusters of comments, then, the Dynamic Discussion Topics Illustrator Algorithm extracts from the topic models the Discussion Topic Flows (DTF). A DTF is a graph aiming to model social media conversations' evolution over time.
Finally, the authors applied the proposed framework to a collection of Reddit comments regarding Fedora Linux aiming at identifying the main reasons for user dissatisfaction.

While the output of the framework is very convincing and clear, I see major limitations that discourage acceptance.

My main concern about the paper is its relevance concerning the special issue. The framework mainly relies on NLP tools and, honestly, I don’t see any relation with the Semantic Web. As far as I understand, what the authors call “semantic maps” are networks where the nodes are topics (which are not disambiguated nor aligned with existing resources) and the edges are similarity relations among them.

I also found the paper hard to read. The paper describes the components of the framework without letting the user understand the rationale behind the design choices. For example, it seems that clustering and topic modelling (in a certain sense) conflict. Why do the comments need to be clustered before extracting the topic model? Extracting only one topic per cluster seems a bit strange to me. The resulting topic is a weighted list of words which could have been for example extracted by computing the TF-IDF of the terms.

The relation with existing work is not clear. Section 2 reports on related approaches without clarifying how the proposed framework is connected to them.

Review #2
Anonymous submitted on 08/Mar/2023
Review Comment:

The paper presents a tool-supported method for creating overviews of how online discussions evolve thematically over time. Preprocessed posts from each time interval are represented as low-dimensional vectors and clustered into interval clusters. Interval clusters are then merged within and across consecutive time intervals to form topic flows, which are visualised as alluvial diagrams. The tool has been used to analyse one year of 22k Reddit postings about the Fedora operating system. The created overview has been evaluated using internal measures of cluster quality, structure, and topic coherence.

Social media discussions already play important roles in almost all aspects of public life. Developing better methods and tools to understand their central topics and how they evolve over time is therefore a highly important research area. The paper is mostly well-written and structured and easy to follow. The technical implementation and evaluation appear carefully and thoroughly carried out. However, the manuscript in its present form also has several weaknesses.

- Fit for the Semantic Web Journal: Although the tool uses general graph representations and NL analysis of social-message semantics, it does not use or attempt to contribute to techniques of practices that are central to the semantic web, linked data, ontologies, or knowledge graphs. The semantic web is mentioned several times in the introduction, but never again after the first sentence of the background section. Instead, the tool's goal is to better understand social-media contents, so it would more appropriately be directed to that research community.

- Lack of comparison with related work. Another weakness it the lack of comparison with other approaches to topic clustering. Although the background section mentions a few approaches with similar aims, they are not presented in any detail and there is no attempt to empirically compare the paper's proposal with existing methods. Section 5.4 "Results discussion" does not contain a single reference to other work.

- Lack of evaluation with human users. This is another critical limitation of the present version of the paper. The evaluation uses only internal measures. It is unclear how well these measures reflect the needs of human users. For example, the paper does not argue convincingly for the relevance of the "length and events quantity" measures used to evaluation topic-flows. Empirical assessment of the results topic flows and their visualisations by human subjects is therefore called for before the paper can be accepted in a top-level journal.

- Presentation of the algorithms. The "Discussion Topic Flows" algorithm is described only briefly and informally in the main text. The readers are referred to a table (Alg. 2) with pseudocode that seems to leave some detail out. For example: why is nodesStack treated as a stack and not just a list (it is only popped, never pushed to). The algorithm comes through as under-explained. As a result, it is also not clear how original it is.

Other issues:

- Use of pre-processing in combination with doc2vec needs motivation. Le and Mikolov's original paper [15] that you cite does not seem to use similar pre-processing (they even treat special characters such as ,.!? as normal words). The choice of preprocessing techniques therefore needs explaning.

- Use of dimensionality reduction after doc2vec. As you explain, doc2vec offers "a customizable number of dimensions". So why is a separate step needed to further reduce the number of dimensions (instead of setting the wanted dimensionality directly in doc2vec)? Also, the dimensionalities before and after reduction (15 and 3) appear very low and call for explanation.

- Use of pre-determined time intervals. The user is expected to provide the number and length of each time interval as inputs to the method. But these parameters might be better extracted from the data. Although this might go beyond the scope of the present paper, the possibility should be mentioned and your choice explained.

- The alluvial diagram in Figure 2 has become very complex already with only 6 topic flows. Some of the relations seem to pass through nodes and they can be hard to discriminate from those relations that connect nodes. This solution does not seem to scale well enough to be useful in realistic cases. The paper admits this, but does not discuss mitigations or alternatives.

- The caption of Table 1 mentions both "unbalanced, and isolated" cases, but only one of them seems to appear in the table and text. The practical relevance of the "unbalanced" case needs more explanation.

Long-term stable URL for resources: It is a GitHub repository that appears complete code-wise, but I could not find the subreddit dataset on Fedora. The README.md is only three lines including the paper title.