CANARD: An Approach for Generating Expressive Correspondences based on Alignment Need and ABox-based Relation Discovery

Tracking #: 3364-4578

Authors: 
Elodie Thieblin
Guilherme Sousa
Ollivier Haemmerlé
Cassia Trojahn dos Santos

Responsible editor: 
Jérôme Euzenat

Submission type: 
Full Paper
Abstract: 
Ontology matching aims at making ontologies interoperable. While the field has fully developed in the last years, most approaches are still limited to the generation of simple correspondences. More expressiveness is however required to better address the different kinds of ontology heterogeneities. This paper presents CANARD (Complex Alignment Need and A-box based Relation Discovery), an approach for generating expressive correspondences that relies on the notion of competency questions for alignment (CQA). A CQA expresses the user knowledge needs in terms of alignment and aims at reducing the alignment space. The approach takes as input a set of CQAs as SPARQL queries over the source ontology. The generation of correspondences is performed by matching the subgraph from the source CQA to the similar surroundings of the instances from the target ontology. Evaluation is carried out on both synthetic and real-word datasets. The impact of several approach parameters is discussed.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 22/Mar/2023
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

The paper tackles the challenge of identifying complex mappings across ontologies and proposes CANARD, an approach based on the description of requirements in unary or binary SPARQL queries as input, so-called Competency Questions for Alignment (CQAs). CANARD distinguishes itself as the first using CQAs to reduce the search space, thus capable of handling large ontologies and not relying on any complex mapping patterns. The work has been presented as an ISWC 2020 paper previously, and the current journal version explicitly lists the extensions. This is the second round review, and I didn’t participate in the first round. After reading the first round comments, the authors’ responses, the current updated version, and the ISWC 2020 version, I eventually believe that the paper presents a unique way, with a fair group of extensions from its conference publication, the evaluation is comprehensive, and the analysis is of depth showing both the advantages and limitations, and thus would like to recommend Minor Revision.

There are still some issues, small or big, that should be addressed before the paper becomes ready for a journal publication. Let me list them as follows.

TITLE. The title of the paper needs to be crafted. Firstly, “alignment need” is seldom used in the paper whereas “user’s need” or “user knowledge need” occurs a lot. Secondly, “ABox-based relation discovery” is seldom used in the paper either, and what’s the relationship between “alignment need” and “ABox-based relation discovery”? As a matter of fact, it seems to me the involvement of user is not necessary in the approach, as it is not interactive mapping anyway. The “user” in the paper as well as “user knowledge need” is all ambiguous. I suggest to rename the title to sth like “CANARD: An Approach for Generating Expressive Correspondences based on Competency Questions for Alignment”, as this is exactly what the approach is. And when introducing the problem or discussing the application, you can mention that WHO can give these CQAs, those intend to query the source ontology (for instance, ontology-based application user), to link the source ontology to target one (for instance, ontology engineer), or others. I notice that the mentioning of “user” is frequent all through the paper, so “de-user” may take some work.

ABSTRACT. The abstract is incomplete as nothing has been said about the results of the evaluation, nor any conclusions about the approach.

INTRODUCTION. Before listing the extensions based on the ISWC 2020 conference version, an explicit statement of the contributions of the paper is needed.

MAIN STEPS. In Section 4, each main step of the approach is described. Some of them are approximate with uncertainty, like 4.4 Label Similarity, and the others might be sound and complete, like 4.1 Translating CQAs into DL Formulae. These should be explicitly stated, as the former needs empirical evaluation and the latter theoretical proving.

QUALITATIVE ANALYSIS. I agree to a comment from previous round review that a qualitative analysis of the resultant mappings is needed as complementary to quantitative measure evaluation. One way to do this is in Section 5.6 Comparison on the OAEI Systems, to add a table of (part of) the complex mappings uniquely identified by CANARD whereas missed by others. This can greatly convince the power of the proposed approach.

SIMPLE VS COMPLEX. As both simple and complex mappings are simultaneously identified in CANARD, situations like one-to-one correspondences misjudged as complex ones and complex correspondences misjudged as one-to-one ones should all be discussed. In real-world ontology applications, being good at identifying simple mappings is as important as identifying complex ones.

MORE THAN TWO ONTOLOGIES. When introducing CQAs, “two or more ontologies” occurs a couple of times in the paper. If CANARD is capable of matching more than two ontologies in a nontrivial way, please elaborate on this matter; otherwise remove “or more”.

LIMITATION. I don’t think that the user has to be familiar with SPARQL and the source ontology is a serious limitation of the approach. This is a problem concerning the application of the approach. Fundamentally, being constrained by the expressivity of SPARQL limits the complex matches that can be found. This should be discussed in depth in the paper, together with possible expansions as future work.

RELATED WORK. Matching methods capable of identifying both simple and complex correspondences that do not rely on any questions, patterns, or instances like the following should be mentioned:
Mengyi Zhao, Songmao Zhang, Weizhuo Li and Guowei Chen. Matching biomedical ontologies based on formal concept analysis. Journal of Biomedical Semantics, 2018 Mar 19;9(1):11. doi: 10.1186/s13326-018-0178-9

Lastly, several writing errors:
- Page 2, a system that discovers expressive correspondences2 The content of footnote 2 should be replaced to the text in the paper, as it's important to know what kind of expressive correspondences can be found by CANARD.
- Page 2, Section 2.1, using “:” to link a definition with an example is not appropriate.
- Page 28, with a (the detailed … in Section 5.3. the phrase is incomplete.

Review #2
Anonymous submitted on 06/Apr/2023
Suggestion:
Accept
Review Comment:

The authors have responded convincingly and addressed the various questions and suggestions. The paper is now clearer and the addition to the already published work is better illustrated in terms of improvements in the performance of the system, comparisons within the OAEI campaign, the addition of definitions and examples, and the extension of the state of the art.
Some typos:
is however => is, however,
as for simple alignment => as for the simple alignment
as answer => as an answer
allows to isolate => allows to isolating
DL formula similarity similarity => DL formula similarity
the correspondence are not => the correspondences are not
and take approximately 6:18 hours => and takes approximately 6:18 hours

Review #3
Anonymous submitted on 26/Apr/2023
Suggestion:
Minor Revision
Review Comment:

I thank the authors for the several improvements they made on the paper. I find the manuscript much improved.

However a few aspects still require some attention:

From my previous review:
"While I understand these two
tasks do not come with pre-de ned CQAs, these results could highlight the reliance of
CANARD on manually de ned CQAs vs automatically generated ones. In fact, it would
be great to see an evaluation for Conference, based on both the high-quality CQAs and
automatically generated ones (which were made for CANARD's 2020 OM paper)."

Authors have added a new section showing OAEI results, which is great, but the discussion does not focus at all on the aspects I mentioned in my review. I would like to understand why this was not addressed in the paper nor justified in the letter.

I stated in my previous review: "In Equation 4, the sum of labelSim and structureSim adds
up to 1.5, since labelSim is in [0,1] and structureSim is set to 0.5 or 0. Is this correct?
I was expecting values of similarity in [0,1]. Why this unusual definition of similarity?
And then later "When putting the DL formula in a correspondence, if its similarity
score is greater than 1, the correspondence con dence value is set to 1." This means
that a Levenshtein similarity of 0.5."

The authors simply answer: This is not an usual combination of similarities. If it score higher than 1.0 it is normalised to 1.0.

I point out that it is not the combination of similarities I find unusual, but rather that it is ranged from 0 to 1.5. In their answer authors state that values above 1.0 are normalised to 1.0, but that is not what is described in the paper. Normalisation is not at all just making every value above 1 go to 1. So I ask again, why have this be 0-1.5 and then cut values at 1.0 instead of being 0-1 and never having to cut-off values greater than 1.0?

Minor:
Incomplete sentence. Page28. "with a (the detailed results are described
in Section 5.3."