Enhancing Ontology Matching: Lexically and Syntactically Standardizing Ontologies Through Customized Lexical Analyzers

Tracking #: 3649-4863

Authors: 
Jomar Silva
Kate Revoredo
Fernanda Araujo Baião
Cabral Lima
Jérôme Euzenat

Responsible editor: 
Guest Editors OM-ML 2024

Submission type: 
Full Paper
Abstract: 
Ontology matching systems commonly leverage linguistic metrics to establish mappings between entities within the ontologies undergoing alignment. However, due to the absence of standardized entity names across these ontologies, such metrics may lead to some correct mappings not to be selected. Existing methodologies, which focus on standardizing entity names, often do so without considering the ongoing matching, potentially resulting in inaccurate outcomes. These tools also, in general, are not concerned with the syntactic standardization of entity names. To address this issue, in this paper, we introduce a novel approach to standardize both lexically and syntactically the entity names through the development of a customized lexical analyzer tailored to the aligned ontologies. We evaluate the efficacy of this approach using ALIN, an interactive ontology matching system, along with the human and mouse ontologies from the Anatomy track of the OAEI. Our findings demonstrate an improvement in the quality of the alignment results.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ondřej Zamazal submitted on 20/May/2024
Suggestion:
Major Revision
Review Comment:

Generally, the idea of a separate lexical and syntactical phase before applying ontology matchers is appealing. Usually, this step is hardcoded in the ontology matching systems. It could be done either in general or considering given ontologies. Furthermore, lexical and syntactical steps could also directly consider the other ontology to be matched. Then, it is already a kind of ontology matching, but all these options could help ontology matching tools in their matching activity where not only the lexical aspect is involved.

The paper proposes the lexical analyzer, which considers ontologies and matching activity. My concern is that the paper is mainly about the lexical analyzer approach (Sections 1, 2, and 3), but then it is evaluated only within the ALIN matching system. It is inconsistent, and the paper should either be about the lexical analyzer with a general evaluation regarding more matching systems or rewritten to be about the ALIN matching system from the beginning. Currently, the experiment setting is not appropriate, and it does not evaluate the idea of a lexical analyzer but rather the one matching system that applies lexical preprocessing.

If the paper would be about the ALIN matching system:
* The new ALIN metric should be explained in the main text, which would need a substantial rewriting.
* The whole ALIN pipeline should be explained, including using string-matching techniques, etc.

If the paper would be generally about the lexical analyzer approach:
* It is applied to the human and mouse ontology matching pair, but discussing how to apply it generally in any ontology matching pair is needed.
* Throughout the paper, it is described that the lexical analyzer approach would save time for domain experts when they are less involved in the ontology matching validation step. However, the currently proposed iterative lexical analyzer approach is very time-consuming, and NLP experts are expected to be involved. Therefore, the burden moves from domain experts to NLP experts. It should be discussed how demanding and doable it is, and ideally, it should also be tested with a couple of NLP experts in the experiment to determine whether it is doable.

Further remarks:
* The paper contains definitions of lexical aspects. However, I miss the entities' definitions and entity names, i.e., what the ontology is and its content need to be clarified.
* The related work section is overcrowded with references. Instead, it would help to add the summary table where the systems are in rows, and their lexical processing techniques are depicted in the columns, such as stemming, separating words, etc. Individual cells would depict whether the system applies the given lexical processing technique.
* Generally, the description of the whole process should be improved, e.g., it is mainly about the lexical analyzer, but there is also a lexical analyzer generator. It could help depict the whole process in pseudocode.
* The workflow for NLP experts is described in Section 3.2, but it often needs clarification. It is also quite demanding and time-consuming for NLP experts, e.g., "copy the lines and paste it into the lexical analyzer," "assess how to implement the standardization technique," "check to see if there is an existing program," "adjust the lexical analyzer." Moreover, it seems that steps 2 and 3 should be swapped. The section needs a substantial rewriting.
* As I explained, I think the experiment setup is improper: the evaluation or the first part of the paper should be changed. It is about evaluating ALIN, written in 4.3, "To evaluate ALIN." However, based on the paper title and the first three sections, it should evaluate the lexical analyzer with regard to more matching systems and experiment with an NLP expert's involvement.
* The results about the comparison of executions 4 and 5 are not very convincing: F-measure 0.941 vs. 0.953, number of interactions 357 vs. 405. However, we should also consider the time-consuming work of NLP experts. Although it is claimed that the ALIN with current lexical analyzers achieves better results than in OAEI 2023, the F-measure is the same, 0.952, with lower total requests: 405 vs. 514.

The paper is original, and the authors provided the code in the OSF repository equipped with a README file. It appears the provided resources enable reproducibility of the results from the paper.

Review #2
Anonymous submitted on 22/Jun/2024
Suggestion:
Minor Revision
Review Comment:

This paper introduces a novel approach for standardizing entity names in ontologies using a customized lexical analyser, improving the accuracy of entity matching. The method was evaluated using the ALIN system on human and mouse anatomy ontologies, showing significant improvement in alignment quality. This work is particularly important as it addresses the critical issue of inconsistent entity names, enhancing the overall effectiveness of ontology matching.

The paper is very interesting as it improves ontology matching results for the Anatomy track, which is one of the most well-known OAEI tracks used for evaluating ontology matching systems. After making the necessary revisions (outlined in the review), I believe the paper can be accepted for publication.

Minor
1) “Definition 3.7 (Term Variation). When two terms in a domain refer to the same concept, we say they are variants of the same term. A way to obtain a variation of a term is by replacing an adjective with a modifier noun. We find some of these variations in human and mouse ontologies. This type of variation is called derivation. For instance, ’enzymatic activity’ has the variation ’enzyme activity’. Another way to obtain a variation of the term is through permutation. Permutation can occur when a term contains prepositions. For example, ’activity of enzyme’ can have the variation ’enzyme activity’.”
Please remove the sentences from the definition:
“For instance, ’enzymatic activity’ has the variation ’enzyme activity’. Another way to obtain a variation of the term is through permutation. Permutation can occur when a term contains prepositions. For example, ’activity of enzyme’ can have the variation ’enzyme activity’.”
An example of term variation does not belong in the definition; place it in a regular paragraph.

2) Do the same for all other definitions in the paper.

3) In my humble opinion, Figure 1 is a bit unclear at first glance and of questionable quality. Could you revise it so that it is immediately clear to the reader how the entire process works?

4) WordNet, Jaccard, Jaro-Winkler, etc. Please reference them in the literature.

5) In Table 2, it is stated that ALIN from 2023 and your new version of ALIN proposed in this paper have the same results. Is this a mistake? On the official OAEI 2023 website (http://oaei.ontologymatching.org/2023/results/anatomy/index.html), ALIN achieved the following results:
Precision 0.984
F-measure 0.852
Recall 0.752
This is different from what is stated in Table 2. Please correct this.

6) Something to think about. Did ALIN with the proposed lexical analyser and ALIN metric have an advantage over other systems in the Anatomy track since the lexical analyser was created with the presence of an NLP expert in an iterative process, and no one else has been able to use it because it did not exist before? In my humble opinion, I would not compare ALIN with other matching systems in Table 2 but rather compare ALIN with the proposed lexical analyser and ALIN metric to ALIN from 2023. Please revise this.

Review #3
Anonymous submitted on 16/Jul/2024
Suggestion:
Reject
Review Comment:

High-level assessment of dimensions for research contributions as per https://www.semantic-web-journal.net/authors:
(1) originality:
The paper presents a novel mechanism of standardizing terms across ontologies given a matching task. There are a number of original ideas presented in this paper and I find that reasonable for a journal paper on this topic.
(2) significance of the results:
I have serious doubts about the significance of the work and the results. The results are only on two ontologies, and as Table 2 shows, the accuracy is the same as a state-of-the-art solution from 2021 with more than double the number of requests. A high-quality journal article needs more than one pair of ontologies to properly evaluate a hypothesis.
(3) quality of writing:
The paper's writing can be improved in many ways. The basic English writing needs improvement. The paper has many short 1 or 2-sentence paragraphs, there are errors like "Section 2, we present...", "related works", and putting a large number of citations in the middle of a sentence. Although these don't affect readability in most cases, the paper overall can also have better flow and clarity in most parts. There are a few examples here and there, but the paper could benefit from one or two very clear running/motivating examples. How would one perform the matching for an example matching scenario with state-of-the-art solutions, and how would one do it with your solution? In Figure 1, what is the starting point? When would the process end? Figures 2 and 3 can be replaced with tables. You don't need a large 3D chart for these, and you can instead use the space to improve readability. I also had a very hard time following the contents in the Appendices. Java code in Appendix II is very strange in my view.

> Please also assess the data file provided by the authors under “Long-term stable URL for resources”.
The provided link https://osf.io/gc6jm/ gives me an access denied error.

Apart from the above issues, I do not see any Machine Learning of any kind in this paper, which makes it unfit for the special issue. Even ignoring the topic of the special issue, a paper on this topic needs at least a discussion of why ML models and Large Language Models (LLMs) cannot solve the task at hand.

Overall, despite the merits of the work and some very interesting ideas and promising results, the paper needs a major revision addressing the presentation and technical issues to make it suitable for a regular issue of the journal.