Identifying Topics from Micropost Collections using Linked Open Data

Tracking #: 1807-3020

Ahmet Yildirim
Suzan Uskudarli

Responsible editor: 
Krzysztof Janowicz

Submission type: 
Full Paper
The extensive use of social media for sharing and obtaining information has resulted in the development of topic detection models to facilitate the comprehension of the overwhelming amount of short and distributed posts. Probabilistic topic models, such as Latent Dirichlet allocation, represent topics as sets of terms that are useful for many automated processes. However, the determination of what a topic is about is left as a further task. Alternatively, techniques that produce summaries are human comprehensible, but less suitable for automated processing. This work proposes an approach that utilizes Linked Open Data (LOD) resources to extract semantically represented topics from collections of microposts. The proposed approach utilizes entity linking to identify the elements of topics from microposts. The elements are related through co-occurrence graphs, which are processed to yield topics. The topics are represented using an ontology that is introduced for this purpose. A prototype of the approach is used to identify topics from 11 datasets consisting of more than one million posts collected from Twitter during various events, such as the 2016 US election debates and the death of Carrie Fisher. The characteristics of the approach and more than 5 thousand generated topics are described in detail. A human evaluation of topics from 30 randomly selected intervals resulted in a precision of 81.0% and F1 score of 93.3%. Furthermore, they are compared with topics generated from the same datasets with two different kinds of topic models. The potentials of semantic topics in revealing information, that is not otherwise easily observable, is demonstrated with semantic queries of various complexities.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Carlos Badenes-Olmedo submitted on 19/Feb/2018
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

In the field of social networks, especially those aimed at publishing short and informal comments such as Twitter, an unsupervised system called S-BOUN-TI is proposed for the identification of relevant topics and their description by the Topicos ontology. In this domain, a topic is considered as a collection of entities (e. g. people, places, organizations) and temporal expressions. The idea behind the proposal is that entities that appear together in different publications can describe a topic, and eventually consolidate it if the number of publications is high enough.

# Overall Comments
The paper is well written and structured in all its parts.

The main challenge it addresses is not the identification of topics from short texts, but the description of these topics through an ontology that facilitates their comprehension (i.e. semantic topics). In this case, an approach based on Linked Data is chosen which delegates the initial identification of entities to an external tool. How these entities are processed to create groups that will eventually become topics, and their representation through an ontology, as well as the ontology itself, are the main contributions of this paper.

An extensive description of the evaluations has been carried out, as well as of the results, both partial and final.

# Strengths
The availability of all the resources used during the experiment is greatly appreciated (although the SparQL Endpoint was temporarily down).

[topic identification]
The way in which entities and temporal expressions are grouped together using external semantic resources (e. g. DBpedia, Wikidata) and graph theory is interesting. If I have understood correctly, the algorithm follows a density-based approach to identify groups of entities that ultimately suggest topics. The strength of the links lies in the joint appearance of the entities in the same text.

# Weaknesses
[topic vs event]
The work is based on the consideration of topic as a collection of entities and temporal expressions. In this domain, it creates an ontology that allows tweets to be annotated. However, it is not clear enough what would be the difference between what is considered topic and what would be an event associated with a news story, which could be also described as sets of entities in a temporary instant. It would be appreciated a greater justification of the approach followed to describe a topic that reinforces all the work that is done later in this line.

[topico ontology]
Taking into account the previous aspect, there is a lack of a review on existing ontologies that describe entities linked in the same instant of time, without needing to be called topic.

A good opportunity to manage polysemy is missed by reducing the relationship between surface forms (spots) and entities to one by one (Section 4.2, 1st paragraph). Using the most frequent relationship may be sufficient in some cases, but in others it can introduces noise.

[topic identification]
- It would be advisable to describe what are the reasons for considering a clique belonging to a topic or not. (Section 4.4, 3th paragraph): "However, we manually inspected cliques .."
- Being the property' isAbout' perhaps the most relevant of a topic, would appreciate more information about its construction.

- When the proposal described in the paper is compared with an LDA-based one, a strings-based measure is established as the distance metric. This metric compares the entities and temporal expressions that make up the topics discovered by S-BOUN-TI, with the most relevant terms of the probabilistic topics generated by LDA. The first approach uses an external knowledge base (e. g. DBPedia, Wikidata) to disambiguate and normalize the entities. But, no pre or post-processing of the text is described when training the LDA model that allows comparing strings.
- There is a lack of tests with different LDA hyperparameter configurations, since the generated models depend to a large extent on their values. (Section 6.6.2)
- Why was Jaccard Similarity used and not other alternatives? (Section 6.6.2)
- It would be advisable to consider all the topics, not only those with higher number of topic elements, in the evaluation to see how the algorithm behaves in those cases.
- The number of evaluators involved in the evaluation should be included to support the consensus values shown. (Section 6.3, 5th paragraph)
- Since the identification of entities is so critical to the final behavior of the algorithm, a comparison with other alternatives to TagMe would be welcome (Section 6.5.1).

- Section 1, 5th paragraph: "topicss" -> "topics"
- Section 3, 1st paragraph: The term topic is used to define the topic itself.
- Section 3, end of 1st paragraph: details about topic generation are covered in Section 4, not 5.
- Section 3.2, end of 2nd paragraph: whitespace is missing in "Time ontologydoes not cover".
- Section 4, 1st paragraph: It's not clear what the next sentence is:"In general, there is a collective alignment regarding relatively few elements within posts, however those elements occur with numerous other elements"
- Section 5.1, 4st paragraph: "micrblog" -> "microblog"
- Figure 6: "Idenitfy Topics" -> "Identify Topics"
- Section 6.2, 7th paragraph: "their their" -> "their"
- Section 6.5.1, 1st paragraph: missing "them" in -> ".. at least one and 36% of (them) have .."
- Section 6.5.2, 2nd paragraph: "topic elemets" -> "topic elements"
- Section 6.6, 1st paragraph: comma between "S-BOUN-TI" and "LDA"
- Section 6.6.1, 2nd paragraph: "ten topmost" -> "top ten most"

[missing references]
- Section 2.1, end of 1st paragraph: references to works in entity detection and linking from microposts.
- Section 2.4, 1st paragraph: resources used to express topics.
- Section 4.4, 1st paragraph: references about the maximal cliques algorithm.
- Section 5.1.2, 1st paragraph: list of rules used to identify seasons and month names.
- Section 6.6.2: "LDA is known to favor lower values for N"

# Suggestions
When presenting S-BOUN-TI as an alternative to LDA and LSA-based approaches for identifying topics from unstructured texts, it would be advisable to do so from a more general point of view. For instance, by first presenting the solutions based on probabilistic approaches (e. g. LDA), then those based on Factorization Matrix (e. g. LSA) and finally those that follow a density-based approach, where S-BOUN-TI could be.

[topic as documents]
Since a topic is said (Section 2.2, 2nd paragraph) to be considered as a document, could an ontology describing documents be used to describe the topics?

Include a table showing the results of the BOUN-TI algorithm.(Section 6.6.1)

# Conclusion
Based on the above, I would recommend a minor revision to value the definition of a topic as an aggregation of entities, and justify the creation of an ontology different from those already existing to describe events associated with news.

Review #2
By Gengchen Mai submitted on 26/Feb/2018
Major Revision
Review Comment:

The paper proposes a topic detection model named S-BOUN-TI for microposts which utilizes the entity linking and maximum cliques methods to extract semantically represented topics from microposts. An ontology Topico is designed to represent these extracted topics in Linked Data. Topic elements are extracted for each topic including persons, locations, related terms, time duration. These structured data of topics are published through Fuseki.

I list my comments below.

1. Many topic modeling techniques have been developed these days. One popular technique is LDA. However, it is always challenging to interpret the semantics of each topic extracted from LDA model. This paper proposes a topic detection approach which can formally express the semantics of each topic using Linked Data. The direction of this work is very promising.

2. In Section 6.4, SPARQL query examples have been shown and discussed how to utilize the extracted semantically represented topics. Those examples which combine the knowledge graph of extracted topics with other knowledge graphs show the advantages of serializing the topics in Linked Data.

1. The paper is too long but some important information is missing in the content:
a. There are a lot of hyperparameters in S-BOUN-TI. But these parameters are scattered all over the place in the paper. A section to list and formally discuss all these parameters is necessary. All these hyperparameters are set by the author arbitrarily. What is the effect of these parameters on the final result? A section to talk about this is very important.
b. In the evaluation of S-BOUN-TI, this paper compares LDA with S-BOUN-TI by using substring matching. Since the topic elements of LDA are terms while the topic elements of S-BOUN-TI are entities, I assume “substring matching” means matching the entities' labels with the term. But the topics elements of S-BOUN-TI are in different types like persons, locations, related terms, time. Should elements with different types be treated equally or differently? Which thresholds you choose for the “substring matching” to determine they are sufficiently “matched?” One example of this will be better to help the reader understand this step.

2. If I count them correctly, there are 7 hyperparameters in S-BOUN-TI while LDA only has one hyperparameter. Many parameters make the model difficult to use and tune. I wonder it is possible to eliminate some of these parameters.

3. In Fig. 7, an example of the extracted topic is shown. But the standard by which the authors use to name each topic entity’s IRI is not available. For example, as for topico:2016_first_presidential_debate_46_48_topic_19, I can understand the name is made by concatenating the datasets’ name, the time interval. But I do not understand what “topic 19” means. In Section 6.2, the authors wrote: “each topic corresponds to a two-minute interval within a dataset (approximately 5,800 tweets).” So a time interval in one dataset should correspond to one topic. What 19 means in this case?

4. As the same sentence:“each topic corresponds to a two-minute interval within a dataset (approximately 5,800 tweets),” I do not understand why the authors arbitrarily decide two minutes as the time interval. What’s more, the phrase “corresponds to” is confusing to me. Does it mean the time interval is a topic (then we do not need a topic detection model)? Or it means all the tweets within this time interval are a “document” as an analogy to the document term in LDA?

5. As for a topic detection model, it is very important to point out what is a “document”, what is a “word”, and what is a “topic” as an analogy to LDA. But the authors fail to point this out which makes me have difficulty fully understanding the model. When I first read this paper, I assume a “document” is a tweet. But only a few entities can be detected from one tweet which will make it impossible to do maximum clique algorithm. So from my understanding, all tweets within two-minute item interval in a dataset are treated as a “document.” (Please correct me if I am wrong.)

6. A weighted co-occurrence graph is built and used to obtain topics. But it is not clear whether the authors construct co-occurrence graphs separately for each dataset, or each time interval, or construct a global co-occurrence graph for these 11 datasets.

7. In LDA, each document can be represented as a real value vector. Each dimension is a topic. But for S-BOUN-TI, firstly, the definition of “documents” is not clear. Secondly, the detected topics do not share the same semantics among different documents which will make document classification impossible.

8. The table in Appendix D is messed up in the PDF version.

9. The candidate elements improving technique in Section 4.2 is very problematic. It is normal that each spot may link to different entities. Washington State and Washington, DC might be both referred as “Washington”. Linking one spot to the entity it is most often linked to will corrupt many original correct linking. What’s more, the authors do not give an evaluation of this technique.

Review #3
Anonymous submitted on 04/May/2018
Major Revision
Review Comment:

The paper presents a methodology (S-BOUN-TI) for detecting topics from a collection of microposts by using Liked Open Data. This approach tags the microposts with relevant DBpedia concepts (using the TagMe tagger), builds their co-occurrence network, and runs a community detection algorithm to obtain a set of clusters that should represent the pertinent topics. The paper also introduces the Topico Ontology for representing topics in RDF.

The paper is well written and clear. The topic is significant and of potential impact.

The approach is not particularly novel, but it appears to be sound. The Topico ontology is interesting, but it seems to be used only as a way to represent the output, rather than for supporting the method. The topics produced by the methods seem to lack granularity (see comments to section 4.4).

The evaluation is lacking, since to my understanding it does not prove that the proposed method performs comparably or better than the state of the art and – critically- seems to focus on evaluating the extraction of DBpedia concepts rather than the clusters. Indeed, since the concepts are linked to the posts with the state of the art TagMe tagger, the novel contribution seems to be the creation of clusters of concepts. However, the evaluation in Section 6 simply demonstrates that that specific DBpedia concepts are relevant to the post they were tagged, proving mainly the effectiveness of the tagger. In addition, some arbitrary decisions (such as considering a time interval of 2 minutes) are a bit perplexing to me.

In the following, I will refer to specific portions of the paper.

3 Topico Ontology

I appreciate the new resource, but it is not clear to me what is its utility for the proposed method. In which way does the Topico ontology support the technique used by S-BOUN-TI to find the topics? In my understanding, it is only used for representing the outcome of the algorithm. If this is the case, I would move the section Topic Ontology after the section Topic Identification.

4 Topic Identification

“To the best of our knowledge, this is the first time that such an approach is proposed.”
While I am unaware of similar approaches in the specific scenario of micropost analysis, there are some approaches that apply similar techniques (and were not included in the related work section). In particular, I would suggest to consider:
1) Augur (Salatino, A., Osborne, F. and Motta, E., 2018. AUGUR: Forecasting the Emergence of New Research Topics. In JCDL’18: The 18th ACM/IEEE Joint Conference on Digital Libraries. ACM, New York, NY, USA.) that present some strong similarity with your technique. Similarly to your method it 1) tags documents (in that case research papers) with semantic concepts, 2) builds their co-occurrence network, 3) applies community detection techniques (Advanced Clique Percolation Method) to extract clusters of concept that are considered as topics, 4) applies the Jaccard index for merging/cleaning the resulting clusters.
2) Topic Sprinkled LDA (Hingmire, S. and Chakraborti, S., 2014. Sprinkling topics for weakly supervised text classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 55-60).) Similarly to your method, it tags documents with semantic concepts from DBpedia and uses this representation to enrich LDA.

4.4. Identify Topics

As previous mentioned, it seems that the added value of S-BOUN-TI is in the clustering step. Therefore, my intuition is that many clique of dimension 2-3, while being formally correct, would be not granular enough to be of interest and to show the value of the approach. For example, many of the topics produced for the Trump-Clinton debate simply contains one of two additional concepts in addition to “Trump” and “Clinton”. I would appreciate some more examples of the kind of topics extracted and their granularity.

6 Experiments and results

I appreciate that the authors made the results of the experiments publically available.

It is not clear to me why the authors decided to split the twitter in interval of two minutes. What is the purpose of this arbitrary value? What are the advantage of splitting the tweets in these intervals of time? Are the set of topics found in subsequent period very different?

As anticipated in the first part of this review, the evaluation seems to evaluate the tagger more than the proposed method and does not compare the results with any state of the art baseline. Indeed, considering specific DBpedia concepts in section 6.3, rather than the clusters, does not allow to evaluate the original part of the proposed approach.

The paper mentions that “The comparison of the results of S-BOUN-TI with
other methods is quite challenging since both the method and the topics they generate are significantly different.” While I agree that is not straightforward to set up a valid evaluation I can see some options that would allow to obtain a more significant comparisons that the ones reported in the paper. A first possibility would be using a metric for evaluating clusters, such as the rand index, and compare different methods versus a manually created gold standard. A second option would be to conduct a users’ study with the aim of proving that your topic representation is perceived as more useful and understandable than current alternative approaches.
At the moment the paper simply analyses the similarity of the results with the ones of 1) an older version of the method and 2) LDA.

In short, my understanding is that the experiments reported in the paper fail to prove that S-BOUN-TI performs equally or better than state of the art methods or that it produces a more comprehensible representation of the topics. I believe this may actually be the case, but it should be proved with a formal evaluation.

8 Related Work

I would suggest anticipating this section, since it is important for the understanding of the domain.