An Unsupervised Approach to Disjointness Learning based on Terminological Cluster Trees

Tracking #: 2233-3446

Claudia d'Amato
Nicola Fanizzi

Responsible editor: 
Philipp Cimiano

Submission type: 
Full Paper
In the context Semantic Web context regarded as a Web of Data, research efforts have been devoted to improving the quality of the ontologies that are used as vocabularies to enable complex services based on automated reasoning. From various surveys it emerges that many domains would require better ontologies that include nonnegligible constraints. In this respect, disjointness axioms are representative of this general problem: these axioms are essential for making the negative knowledge about the domain of interest explicit yet they are often overlooked during the modeling process (thus affecting the efficacy of the reasoning services). To tackle this problem, automated methods for discovering these axioms can be used as a tool for supporting knowledge engineers in the task of modeling new ontologies or evolving existing ones. The current solutions, either those based on statistical correlations or those relying on external corpora, often do not fully exploit the terminology of the knowledge base. Stemming from this consideration, we have been investigating on alternative methods to elicit disjointness axioms from existing ontologies based on the induction of terminological cluster trees, which are logic trees in which each node stands for a cluster of individuals which emerges as a sub-concept. The growth of such trees relies on a divide-and-conquer procedure that assigns, for the cluster representing the root node, one of the concept descriptions generated via a refinement operator and selected according to a heuristic based on the minimization of the risk of overlap between the candidate sub-clusters (quantified in terms of the distance between two prototypical individuals). Preliminary works have showed some shortcomings that are tackled in this paper. To tackle the task of disjointness axioms discovery we have extended the terminological cluster tree induction framework with various contributions which can be summarized as follows: 1) the adoption of different distance measures for clustering the individuals of a knowledge base; 2) the adoption of different heuristics for selecting the most promising concept descriptions; 3) a modified version of the refinement operator to prevent the introduction of inconsistency during the elicitation of the new axioms; 4) the integration of frameworks for the distributed and efficient in-memory processing, namely Spark, for scaling up the set of candidate concepts generated through the refinement operator. A wide empirical evaluation showed the feasibility of the proposed extensions and the improvement with respect to alternative approaches
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Feb/2020
Review Comment:

I was asked to review this new version of the paper, which was adapted according to the reviewer comments from an earlier submission of the paper.
I have carefully studies the new version, particularly w.r.t. my earlier comments about the paper, am satisfied with the improvements made.
The only question you had was w.r.t. my request for more information about the ontologies. As discussed w.r.t. earlier comments, I was interested in the relation between your method and the complexity of the ontologies you applied your methods to. Therefore, I had asked for more details about the practical complexity of the ontogolies, so e.g. what the average size of the axioms was, and what type of operators were really used. But that might not add that much to the paper, I guess.

So, I am happy to recommend to accept the paper in its current form.

Review #2
Anonymous submitted on 12/Mar/2020
Minor Revision
Review Comment:

The new version submitted by the authors has addressed satisfactorily most of the issues raised by the reviewers in the first round. Let me first say that I appreciate a lot the effort that the authors have done to improve the content and tightening the writing. The manuscript has improved a lot and is now well-organized and reads generally well.

Yet, there is one thing that needs to be fixed and there are a number of minor things to fix that I mention below.

The major thing is that I am still not convinced about the SPARK-approach to computing the cluster tree. This part is not really strongly related to the other parts of the paper, that is the algorithm for constructing TCTs nor the evaluation. Further, I really do not get an intuition on how the computation is sped up. I understand that the specialization procedure is computed in a distributed fashion, but I do not see any details in the paper to understand what exactly is distributed, how one avoids to recompute the same refinements in different threads. To me the solution seems to be a dynamic programming approach rather than a distributed approach. In any case, I am not convinced about this section and think that it should be removed from this paper to focus the paper on one contribution and for the paper to gain clarity. As it stands, the explanations are to sparse. This should be published either in a separate paper or not at all as I do not see any particular strong contribution in this part, which seems quite straightforward.

Now the minor issues:


In the context Semantic Web context -> double mention of context

nonneglegible constraints => unclear what non-neglegible means here

Page 2

The effectiveness of the mentioned complex inference services => complex in which sense? Unclear

Sentence starting with "Reasoning under open world-semantics..." is unclear due to many brackets.

Page 3:

Unclear what "The former" refers to in "The former (with a similar structure..."

second colum:

Despite... there are some issues *no comma* that were not

further below

indicates a truly erroneous axiom *no comma* or a special case

Page 6, Algorithm 1, CS in Induce(I,C,CS) is not introduced as input

Page 9:

Not clear what the following means: "It is important to avoid the generation of satisfiable concept descriptions for which the training individuals exhibit a neutral membership" What is neutral membership?

Example 4: Let us *s*uppose (smaller case s)

2nd column

to process such kind of data by means a transparent approach => of a transparent approach

Formula 3 on page 10:

Why is F not part of the index of $\pi$ ?

Page 11

Below formula 6, the bracketing for \pi(a) is wrong. It should be \pi_{(a)} but the authors seem to have written \pi_(a) in latex.

2nd column

for gathering concepts descriptions => gathering concept descriptions

Page 13

The distance measure .. was selected from the family *no comma* with a context of features

In all cases but the first release ... this sentence seems grammatically odd

Page 14

For eliciting the target axioms... this sentences reads oddly

Page 16

instances of C \cap D exceed.... full stop is in next line.

This could yield to limit => awkward formulation

Page 17


In the experiments with GEOSKILLS both all methods => both all sounds weird.

Page 18

in the experiment on the original KBS => KBs

This depended on the complexity *no comma* in terms of syntactic length of the ... odd sentences

=> This depended on the syntactic length of the concept description generated and the threshold on the number...

Page 20 Conclusion

between the farthest elements of a cluster w.r.t. the medoid the other cluster resulting from => grammatically odd sentence

refinement operator with the one used in he previous => in the previous version