The Epistemology of Fine-Grained News Classification

Tracking #: 3659-4873

Authors: 
Enrico Motta
Enrico Daga
Aldo Gangemi
Maia Lunde Gjelsvik
Francesco Osborne
Angelo Salatino

Responsible editor: 
Cogan Shimizu

Submission type: 
Full Paper
Abstract: 
The process of news digitalization over the past decades has released massive amounts of news content, revolutionizing consumer access to news and disrupting traditional business models. These radical changes have also introduced new opportunities for media content analysis, potentially opening up new scenarios for ambitious large-scale media analytics initiatives, which can go well beyond the relatively small-scale studies currently carried out by media scholars and practitioners. However, take-up of computational methods to support media content analysis activities has been rather modest, reflecting a degree of disconnect between the needs of scholars and practitioners for task-specific and usable software solutions and the state of the art in computational techniques for news media analysis. In this paper we perform an initial step towards bridging this gap, by looking in detail at the task of fine-grained news classification. In particular, we propose a typology of news topics, which is formally specified and realised into a family of reusable ontologies. The proposed model has been validated empirically, through an analysis of a multilingual news corpus, as well as formally, in terms of the functional and logical properties of the ontologies. Our analysis brings together the media and computer science literature, connecting the formal definitions provided in this paper to the concepts used by media scholars.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 12/Jun/2024
Suggestion:
Accept
Review Comment:

*Paper*

The authors draw from a broad selection of literature to develop the framework. Their focus on describing all topics, situations, and agents in an article - rather than just the top level - is important for capturing not just that an event occurred (or didn't), but also parties involved and the significance of that event or topic in its situation or context.

The validation of the framework seems promising. It would be interesting to see how the framework performs on coverage of a specific event across multiple publications and media, such as coverage of a presidential debate from MSNBC, NYT, and the BBC. Additionally, could a news article be roughly reconstructed based on the data captured using this framework?

The paper itself is well written, well structured, and easily understandable. It would have been excellent to see some kind of diagram or visual representation of the ontology, but otherwise it is a great piece of academic literature.

*Data*

While it lacks a README per se, the data file is straight-forward and easily understood.

The authors include mention of their annotated corpus and include an in-text link to the corpus in the paper. For the sake of reproducibility it might make sense for that link to be more prominent and/or perhaps provide a link to it in the description of the ontology.

Overall the paper and included resources are interesting, useful, and well-written.

Review #2
Anonymous submitted on 02/Aug/2024
Suggestion:
Minor Revision
Review Comment:

The paper introduces NCO, a complete ontology for news classification, made of 50 classes and 1622 axioms.
For the realisation of the ontology, the authors started from an extensive study of the different approaches, then formalise a conceptual framework which is then formalised in OWL.

The work puts together several aspect, which have in the literature been addressed by different studies, but never systematically harmonised in a single framework. The resulting ontology may indeed be a good solution for applicative domains in a number of applications.
The formal discussion is solid, and the evaluation made on top of the ncoex is convincing.

All the data are properly published on the web and a GitHub repository ensure discoverability and perpetuation. The presence of proper examples is a good added value.

Some minor points which need to be solved before publication:
* Section 3.2.1. The example for the events that we "believe to have happened" seems not to be on point. The Donation of Constantine, novels, movies, etc. have not the characteristics of news (at least in a contemporary sense)
* Section 4.1. The fact that "Donald Trump's financial status" is a subtopic of "financial status" is questionable. While for sure the topic is interesting people searching for news about Trump, I am not sure that those searching about financial status would look for a specific person status, rather than a more global situation. Of course this is a matter of importance of the specific over the global (which can lead to an even broader discussion). Indeed the authors decided to use 2 different strategies for "DT financial status", which is on one side "subtopic" of financial status, and on the other an "aspect" of Donald Trump. These two strategies may be better justified.
* Section 5. You should maybe include a paragraph in which you discuss if your ontology is published according to the FAIR principles or not.
* Section 5.3.2. I don't understand why some properties included a verb (hasElement, hasAspect, hasTopic) while others have note (topicRole). You may want to harmonise this.
* Section 5.5. Which properties are used for aligning NCO and D0? Is the "equivalent" a owl:sameAs?
* Section 6. How many people annotated the corpus? Is there any agreement measure you can report? Any particular comment on how easy is to use the ontology (any ambiguous class/property)? Are the Norwegian news written in English or in Norwegian?
* I am not sure if the IPTC terms should be part of the ontology. The authors may better discuss why this solution is preferable to a separate vocabulary (e.g. in SKOS)

Minor/typos
* Section 5.3.1. nco:hasElementOf ("Of" should be removed)
* Table 1 should not be splitted in 2 pages
* Ontology page online: the link to the source is not working

Finally, I would like to point out some related work mentioning negative (not happened) events, composed events, situations (conditions), etc. https://aclanthology.org/W14-0702.pdf https://doi.org/10.1007/978-3-031-17105-5_9 (even if the focus of this work are news rather than events)

Review #3
Anonymous submitted on 06/Aug/2024
Suggestion:
Major Revision
Review Comment:

The paper presents a new approach to fine grained news classification underpinned by an ontology of news topics.

I agree with the authors that "topic modelling" is too non-deterministic in its outputs, while those outputs are also often difficult to interpret.
Rather, a predefined set of topics may be determined to which news articles can be associated. Here, the choice of those topics is significant as it constrains the subsequent annotation of the documents, and hence it is particularly difficult to come up with annotation schema that can adequately cover broad use cases. In this respect, my first consideration was also how to justify any other classification scheme as better as the IPTC NewsCodes. I am not sure I agree it is "coarse grained" with 1,300 concepts. However, the authors go on to discuss concept detection within news content, which is indeed a finer form of annotation than classification. Maybe referring to the work as "news classification" is misleading in this sense, as it is IMHO used to refer to the allocation of data items to classes, and in this regard IPTC already does offer a broad classification schema. Since the authors move on in Chapter 3 to discuss annotation of news articles with "entities, events, situations.." it seems to me their goal is rather a structured and semantic description of news content and not just a classification (closest to that being 'categorical topics'). This should be clearer from the beginning of the paper (that classification per se, e.g. IPTC, is not expressive enough).

Regarding events, connections between events may be even more significant than collections of events. For example, the spatio-temporal relations between a number of subevents makes up a larger event. This seems similar to your "dependent events" but that you restrict relations to causality? Since your notion of situations also appears to depend on causality - the event's occurrence has created these situations - I feel the distinction between them needs to be made clearer.

Having defined the various aspects of a news classification framework, we turn to their formal modelling. This appears as a regular triple-based knowledge model. I would have preferred to understand first the ontology (the set of classes and relations fundamental to the framework), as here in Chapter 4 a relation like "hasBusinessConnection" simply appears in an example without the clarity if this is part of your news classification framework's vocabulary. A graph illustration of core classes and relations of your ontology at the beginning would greatly help (another example of this is in 4.5.1., where the relation "isEffectiveAgainst" is used but it does not seem to be a core part of the class 'Claim').

Most of the presented modelling is not new or innovative, indeed existing ontologies are (rightfully) reused where they already model the desired aspect. As such, I am not sure if so much space is needed for Chapter 4. Then Chapter 5 largely focuses on the representation of this framework in OWL. An evaluation is focused on the logical and structural consistency.

In my view, the value of developing this framework (and its representation in RDF/OWL) needs to be shown in its usage (creation of semantic descriptions of news items - how it compares to the state of the art) and how those descriptions (conform to the ontology) evaluate "better" than existing classifications (e.g. more expressiveness). In Chapter 6, how many annotators manually classified the 224 news articles? Was annotator agreement tested? I would shorten the previous chapters and provide a more much comprehensive insight into the evaluation results. The results of this manual annotation are presented but I miss a comparative reference -- how can I know this is better or worse than using some other framework? For example, Chapter 7 begins by presenting this to address a "comprehensive model of what types of concepts provide the main subject matter for news items" - so why not compare this framework with other works to capture the meaning of news articles in terms of coverage and completeness? Of course, I appreciate we may be lacking a "ground truth" here of what would constitute the most desirable model.
This would however offer a more holistic view of the work being presented (the "classification framework") and how it compares. Chapter 7 still considers contributions separated into the different aspects. A table at the end would be helpful to highlight in which aspects some innovation is present in your work, and what (other parts are clearly reusing state of the art).

Section 7.4 states truthfully "our model follows standard practice...", and this is for me the major problem with this as a journal paper: highlighting the actual contribution of the work that goes beyond the state of the art. For me, the "limitations of the current solutions" (Chapter 8) are not adequately highlighted, and less so how your work addresses them. It may be an issue of better structuring the paper, and spending less time on the formalization/ontology modelling and more on evaluating and discussing results (a news article described using this framework - how does it compare to SoA models, how is it "better"?). You mention "the needs of media scholars and practitioners", but have these been formalized in any way through a focus group or expert interviews? This could act as a baseline for evaluation.

In conclusion, you also write this is "the first step" and I feel it is indeed a good step forward but possibly not far enough for a journal publication. More empirical evaluation is needed to justify the ontological decisions made.