Focused Categorization Power of Ontologies: General Framework and Study on Simple Existential Concept Expressions

Tracking #: 3008-4222

Vojtěch Svátek
Ondřej Zamazal
Viet Bach Nguyen
Miroslav Vacura
Jiří Ivánek
Ján Kluka

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
When reusing existing ontologies for publishing a dataset in RDF or developing a new ontology, preference may be given to those providing extensive subcategorization for the classes deemed important in the new dataset schema or ontology (focus classes). The reused set of categories may not only consist of named classes but also of some compound concept expressions viewed as meaningful categories by the knowledge engineer and possibly later transformed to a named class, too, in a local setting. We define the general notion of focused categorization power of a given ontology, with respect to a focus class and a concept expression language, as the (estimated) weighted count of the categories that can be built from the ontology’s signature, conform to the language, and are subsumed by the focus class. For the sake of tractable experiments we then formulate a restricted class expression language based on existential restrictions, and heuristically map it to syntactic patterns over ontology axioms (so-called FCE patterns). The characteristics of the chosen concept expression language and associated FCE patterns are investigated using three different empirical sources derived from ontology collections: first, the class expression pattern frequency in class definitions; second, the occurrence of FCE patterns (mapped on the class expression patterns) in the Tbox of ontologies; and last, for two different samples of class expressions generated from the Tbox of ontologies (through the FCE patterns) their ‘meaningfulness’ was assessed by different groups of users, yielding a ‘quality ordering’ of the concept expression patterns. The different types of complementary analyses / experiments are then compared and summarized. Aside the various quantitative findings, we also come up with qualitative insights into the meaning of either explicit or implicit compound concept expressions appearing in the semantic web realms. To allow for further experimentation, a web-based prototype (named OReCaP) was also implemented, which covers the whole process of ontology reuse from keyword-based ontology search through the actual FCP computation to the selection of ontologies to be reused and their enrichment with new named concepts defined through the chosen compound expressions.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Luigi Asprino submitted on 29/Mar/2022
Review Comment:

The authors much improved the quality and clarity of the paper and addressed all the issues I raised in the former review.
Specifically: 1) they included a deeper and accurate description of how the weight function works and substantiated the description with concrete examples; 2) they clarified the role of the FC patterns; 3) they motivated throughout the paper the proposed criteria for assessing the quality of the generated categories.
In my opinion the paper can be accepted for publication as is.

Review #2
Anonymous submitted on 29/Mar/2022
Major Revision
Review Comment:

This paper is a significantly updated resubmission of submission #2406-3620 ().

The cover letter gives the following points as the main changes:

> - completely reworked the formal apparatus, which is now better aligned with the terminology and apparatus used by the description logics community
> - added the pseudo-code of the algorithm realizing the proposed approach
> - added more discussion on limitations and open questions of the approach
> - added the comparison with inductive concept learning, in the Related Research section
> - improved textual passages that had not been well readable, and moved parts of the text around to make the narrative smoother.

In addition, a GUI prototype is provided.

I consider the manuscript an improvement over the earlier work, and significant effort has been invested. Before publication, I think the points (1) - (7) below have to be improved.

In very short summary, I think the paper

(*) has invested clear and high effort in a relevant aspect of ontologies

(*) undersells this by sometimes overly complex exposition (as discussed in detail below) - the paper could transport clearer contribution, novelty and impact in shorter exposition

The manuscript is indeed a significant improvement over the earlier submission. I do think that additional significant effort is needed to bring it up to the quality of the journal, but consider it a worthwhile endeavour given the amount of good work invested in this project.

(1) General readability.

The manuscript, while significantly improved, is not yet at the readability and polish required for a journal publication. I want to explicitly commend the authors' effort in improving this, but remark that a number of items are still of the quality required. In particular this includes general clarity of narrative (the intended meaning behind many paragraphs is at times vague and could be made crisper), colloquial use of punctuation symbols (e.g., brackets opening paragraphs, symbols within words), etc. There are many paragraphs that, to the best of my understanding comparing both versions, could still be improved in that respect. There are some new paragraphs (e.g., the "pizza" example from the introduction) that are not clear without context. I do add that more examples are definitely good for accessibility of this paper.

(2) More "upfront" narrative and examples.

This also relates to some comments from earlier reviews (Reviewer #2 from the earlier version stating "The paper would be inaccessible to the general audience. This is supposed to be a journal publication, but I don’t think that an ordinary PhD student or a young PostDoc would be able to learn much from this paper.") I think this has significantly improved, but still needs a revision to make sure that notions and goals are introduced "upfront". Let me give an example: Compound concept expressions and their corresponding description logics suddenly appear in the (nice) new "working assumptions" subsection. However, for a reader, it is at a point where it is still hard to understand the context.

(3) Clarity of concepts.

The introduction is quite vague at concepts (making it hard for the reader to grasp what it aims for) while the formal definitions are relatively abstract, with very little connection between the two. The examples, while very nice, are not enough to fully transport this. I think there are multiple solutions for this (also depending on the opinion of the other reviewers on this). One of them is

(3.1) Extend the introduction by introducing the concepts one-by-one, including examples, so that a reader gets an understanding of them, then following it by the formal definitions in Section 2 as is.

(3.2) Making the introduction more high-level, and providing Section 2 in a more accessible way, interlacing concepts, intuitions and examples.

I think (3.1) would work better - but any version of the paper that makes clearer the combination of concepts, their intuitions and examples would help here.

(4) Some notions and intentions unclear.

Some notions are used before it is clear what they are aiming for, e.g., "heuristic linking" is used without context in the introduction, and then referenced in the formal definition section. In the introduction that is to a certain extent ok, but in the formal definitions section, a reader does require some context. With regard to intentions, in some parts of the paper (to give one example: Section 2.5) a lot of effort is used to transport some very particular computation, but it is unclear what the intention is, the meaning of the specific choices, etc. Make the intentions clearer, that could make such section short and more crisp (see also points (2) and (3) above).

(5) Some section need a better "roadmap".

I give as an example Section 2, which Section 3 later refers to as "[having defined the framework]": it has a good initial summary, but then loses the reader in sections, where interesting topics are discussed, but the overall structure becomes unclear. Perhaps a more elaborate introduction of such a section could help here, or better statement for each subsection (*) what was shown so far (*) what needs to be discussed next and (*) how do the two connect. Perhaps both would make it a more convincing section.

(6) Some sections need a clearer contribution.

I commend Sections 4 and 5 for the good examples. I think they should stay if at all possible within the space - it is great. What these sections do not transport in enough detail is the contribution they make scientifically to the overall contribution of the paper. The connections between the results explained in Section 5.3 needs to be better connected to the framework, and the scientific contribution and its novelty highlighted more.

(7) Experiments need more explanation of impact.

The cognitive experiments are very interesting and a valuable addition to the field, as are the others. At the moment, I believe the impact of the results is not described enough, especially given how much effort was involved in performing, in particular, the cognitive experiments. I think highlighting more of the impact would make the effort more valued.

Regarding the provided code repository, I consider it good - with some more additions to the readme possible to make it even easier for the users to immediately understand the importance of all parts. I find the code and software provided to be a very good contribution.

Review #3
By Dörthe Arndt submitted on 05/May/2022
Major Revision
Review Comment:

The paper suggests a method to evaluate the suitability of existing ontologies for new use cases. The authors define a measurement that allows the interested user to calculate a value, the so-called Focused Categorization Power (FCP), which estimates for a focus class of interest how well this class can be split into subclasses and concepts for further use. The concept of the FCP is first introduced in a general way to be further specified later for an ontology language which only allows existential quantification. The authors justify their choice with different experiments and explain how the FCP can be calculated in practice.

The paper is in general well written and understandable and the concept of an FCP is interesting. However, I see some problems with the work as it is:

1. The authors define the FCP as a concept of DL-ontology concepts (using OWL-DL) and then calculate an approximation of that value \^{FCP} which only depends on RDF concepts. They argue that they expect users to construct OWL-DL classes from RDF constructs. Since OWL-DL and RDF are not fully compatible (think for example of blank nodes) such a combination will result in an OWL-Full ontology on which it is difficult to do reasoning. So, I wonder how realistic this expected use is. For which purposes will users construct an owl-full ontology? For querying? I would ask the authors to clarify that point.

2. Related to point 1: The notation suggests that \^{FCP} is an approximation of FCP. I have difficulties seeing that relationship. Since FCP is calculated on OWL-DL ontologies and \^{FCP} on RDF ontologies, these two definitions seem to be rather loosely coupled. Is it possible to show that if we construct an OWL-DL ontology from an RDF ontology using the classes as specified, that the values for FCP and \^{FCP} are actually close to each other?

3. In general, I think that syntax and semantics are often mixed up: Some concepts like for example the FCE patterns rely on syntactical concepts which are sometimes a result of selective reasoning. How do we perform such partial reasoning? Doesn’t the fact that we rely on the syntax as stated by the data modeller (as opposed to a reasoning result) rather mean that the authors make assumptions of the modelling behaviour? If that is the case, I would ask the authors to make these assumptions explicit and (if possible) justify them (e.g. by citing studies or sharing own experiences).

More detailed comments/concerns/questions:

- OWL and RDF: Note that OWL-DL and RDF are not fully compatible (think for example of the meaning of blank nodes). So of course one can construct OWL-classes from the RDF ontologies in LOV, but the resulting ontologies will most likely be OWL-Full ontologies which makes it difficult to use them for reasoning. These are still interesting for querying-applications (which seems to be the most relevant use case the authors envision), but some FCP-calculations rely on reasoning and here we could get problems. The same holds for skos ontologies, but there the authors are aware of the problem.

- Mix of syntax and semantics: The authors for example define categories which can be represented using different syntactic constructs and then claim that to calculate the FCP they take the simplest representation into account (see page 7). They justify this by saying that easier concepts are more likely to be reused by users but in fact they cannot judge whether a construct in an ontology is easy if they group them into categories (as an example think of owl:someValuesFrom vs. owl:minCardinatity 1 as two ways to express the same concept). If they argue with the simplicity of patterns they would have to keep these patterns instead of grouping them into classes.

- Subjective aspects of the modelling: There are different places in the paper where the authors suggest that we can derive what the creator of an ontology considered meaningful or how certain concepts are “meant” by the way they are put in the ontology (e.g. they say at page 9 that if there is a domain declaration for a concept but no axiom stating that the predicate is defined for all instances of the class, then we can assume that there exist instances in the class for which there is not relation defined). I consider these assumptions as very strong and would encourage the authors to either make them more specific (if they are their assumptions) or to add some references that data modellers behave that way (if that has been proven before). I would expect people's opinion to differ when it comes to “good” modelling.

- Relation between FCP and its approximation: The formulas we have and the different approximations still look rather random to me in the sense that I cannot see a concrete FCP value (formula 1) which can be approximated by an ^FCP value (formula 3). In my opinion the authors give different ideas on how to calculate heuristics to evaluate an ontology for possible reuse but these are not as strongly connected as the naming in your formulas suggests.

- In general: less text, more definitions/formulas. The formulas somehow get lost in the text. It took me for example some time to see the connection between a category (Cat) and the concrete set of categories FC\cap D. This could be improved by being more concise when writing the definitions and shortening the text.

- Calculation of FCP: Do the authors plan to add some kind of normalisation? How Do I for example know whether the number 24.5 from the example (page 15) is high or not?

- Definition of p3 (Formula 9): I think it is problematic that the definition depends on reasoning for some of the concepts and does not depend on it in some other cases (more specific: they only consider the range declaration when it is asserted). Apart from the fact that this might form a burden in practice - how do we perform such kind of partial reasoning? - it also mixes up syntax and semantics. What happens if we have a data-modeller who explicitly declares that every class is a subclass of owl:Thing? In my opinion, the fact that a data modeller tries to be complete does not change the quality of the ontology. Such details should be irrelevant.

- Universal Restriction, page 20: normally OWL-DL statements are of declarative nature. That is, if the ontology contains a universal statement, the data cannot *contradict* that statement (at least not without negation). It is clear what the authors mean (they want to suggest restrictions based on the data), but they might want to change the wordíng here.

- Number of sections/ length of the paper: the paper covers many different aspects of assessing ontologies for re-use and is therefore very long. I understand that the topic is complex, but I would still recommend to restructure (11 sections are a lot) and to shorten the text. That will make it easier for the reader to put the different sections in relation.

Typos/small comments

- Page 8: equation 3 is referenced before it is introduced, that is difficult for the reader

- Page 9: “In a CEL For the …”

- Page 20: “instead of allowing a names class…” -> named class

- I’d suggest to add some more explanations to the caption of tables and figures, for example page 30, Table 8 -> The caption would be better if it said what can be concluded from the table

- Page 35: “several cognitive studies…” -> (almost) same sentence twice

- RHS: there are many places where you mention the RHS of equivalence declarations. Please clarify what exactly you mean here. For me that is not clear because equivalences are symmetric.