Automatic evaluation of complex alignment: a query-oriented approach

Tracking #: 2137-3350

Elodie Thieblin
Ollivier Haemmerlé
Cassia Trojahn dos Santos

Responsible editor: 
Jens Lehmann

Submission type: 
Full Paper
Ontology matching is the task of generating a set of correspondences (i.e., an alignment) between the entities of different ontologies. While most efforts on alignment evaluation have been dedicated to the evaluation of simple alignments, the emergence of complex approaches requires new strategies for addressing the hard problem of automatically evaluating complex alignments. This paper proposes a benchmark for complex alignment evaluation composed of an automatic evaluation system that relies on queries and instances, and a populated dataset about conference organisation with a set of associated competency questions for alignment as SPARQL queries. State-of-the-art alignments are evaluated and a comprehensive discussion on the difficulties of the evaluation task is provided.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pavel Shvaiko submitted on 24/Jun/2019
Major Revision
Review Comment:

The submission addresses a topic of automatic evaluation of complex alignments, which is underdeveloped and is worth further investigations. In overall, the manuscript represents an incremental improvement over the state of the art and in the current form it does not justify the SWJ publication.

The objectives of the submission are clear and backed up with clearly identified contributions and a motivation example, which is positive.

An overall argumentation flow is smooth. However, Section 2 is too short and can be merged without harm with Section 3. Also, redistributing the relevant parts of the example of 5.5 across the corresponding previous subsections (5.1-5.4), would make the presentation more effective.

Related work is relevantly presented in breadth, though it is inconclusive with respect to the proposed approach; a direct comparison of the proposed work with the state of the art would make the discussion more to the point.

On p.7 the authors state “The comparison of the objects can be performed in a syntactic, semantic or instance-based way”. It is insufficiently justified why exactly these dimensions are considered and as such these appear rather ad hoc. This undermines completeness of the approach.

On p.9 the authors state that “the equivalence relation is considered more specific than the subsumption relations” appears to be counter-intuitive, contradicts set theory, and thus, undermines soundness of the approach.

On p.11 the authors state that two rewriting systems were considered. The first one of [34] is another work by the authors. In turn, the second system, which is based on instances (p.12), is not referenced and it is not clear if it is a guess how things could work or there exists an actual implementation of it.

On p.12 the authors state “So far, no rewriting system dealing with (m:n) correspondences has been proposed in the literature.” The statement is too strong, see, e.g., work by:
- Weifeng Su, Jiying Wang, and Frederick Lochovsky. Holistic schema matching for web query interfaces. In Procs. 10th Conference on Extending Database Technology (EDBT), 2006.
- Songmao Zhang and Olivier Bodenreider. Experience in aligning anatomical ontologies. International Journal on Semantic Web and Information Systems, 2007.

The work on the proposed dataset assumes consistent population of ontologies, this aspect was addressed in the manuscript through artificial populations. Are there any practical applications where such an approach could be deemed realistic? A discussion on limitations of the approach is needed (to strengthen its practical significance) and how these could be overcome, e.g., as part of the future work.

English is sometimes shaky, see e.g., first column of p.4:
- “Which are the accepted paper?” -> paperS
- “What is the decision of a paper?” -> ON a paper
- “What is the rate associated with the review of a given paper?” -> DECISION associated
- These evaluation -> evaluationS
- As in [23] who evaluate -> evaluateS

Review #2
Anonymous submitted on 18/Jul/2019
Major Revision
Review Comment:

This paper proposes a benchmark for complex alignment evaluation composed of an automatic evaluation system
that relies on queries and instances, and a populated dataset about conference organization with a set of associated competency
questions for alignment as SPARQL queries.

The work is very relevant, and producing benchmarks is a much-needed contribution to allow progress in the field of complex matching.

The work is sound and the paper is well-written, however, there are a number of topics that need to be clarified to support a better understanding of the work. I have organized my comments around the main topics that need to be addressed, but the main issues are:

1. There should be a better discussion of the limitations/impact of query rewriting systems. Empirical results could be shown comparing the two documented approaches.
2. The paper addresses both reference alignment and reference query-based evaluations. There are important distinctions between the two that are not always clear throughout the paper
3. The paper puts forward an evaluation system based on instance data, but it lacks a more thorough discussion of the desiderata for instance data to support complex matching.
4. Evaluation based on CQAs may be unintentionally skewed. There is a body of work on complex matching based on patterns, and existing systems (also used in the paper) use this approach. If the CQAs have different coverage for mappings achieved through different patterns, this may have an impact on evaluation. Acknowledge and discuss.
5. I was surprised by the lack of future work.

1. Query rewriting systems

The paper describes the employed query-rewriting systems/approaches very lightly. Authors cite their previous work in [34], but this only presents one of the approaches.

1.1 Both approaches need to be described in more detail

1.1.1 In pp12 "This rewriting system cannot, however, work the other way around. For example, the CQA
?s cmt:hasDecision ?o.
?o a cmt:Acceptance.} cannot be rewritten with c11.""

This is not clear to me. Why can't it be rewritten the other way around? Is it a feature of the rewriting system?

1.1.2 The proposed query rewriting system is not very clearly presented. "It can deal with (m:n) correspondences but cannot combine correspondences in the rewriting process." Can you explain this more clearly?

1.2 Why use the two query rewriting approaches? Are there any advantages of using the one from [20] vs the new one presented here?

1.3. Although the first approach was evaluated in [34], the approach proposed here is not evaluated. Given the impact that the rewriting approach can have on the validity of the evaluation method proposed, a standalone evaluation of the proposed query rewriting method is needed.

1.4. In 5.5.2. "the anchoring phase is strongly dependent on
the employed rewriting system"
After reading this I was hoping to see a study comparing the impact of the two different query rewriting systems.

2. Automated evaluation

2.1. Is there a difference between Anchor selection and Comparison when using a reference alignment?

2.2 In 5.1 "In the case of reference queries, the anchoring phase
consists in translating a source query based on the evaluated
How do you then select which pairs of queries are anchors? I found this paragraph very confusing.

2.3. I find the paper lacks a clearer statement of the conceptual and practical difference between Comparison and Scoring.

2.4. In 5.2, I would appreciate a clarification of "In the case of queries, relations and confidence comparison
may be not expressed and then not used in the comparison step." if they may not be expressed, how are they expressed when they are?

2.5. In 5.3 "There really is no “best
scoring function” or “best metric”. It all depends on
what the evaluation process is supposed to measure."

Could you exemplify what could be different evaluation goals?

2.6 in 5.5 the example with D1 and D2 illustrates a limitation of instance-based evaluation. The paper would improve if a more thorough discussion of the desiderata for instance data to support complex matching was given.

2.7 In 5.5.1. "The pair scores considered in this step are: score(c11; cr1) =
1, score(c12; cr1) = 0:5, score(c21; cr2) = 0:2. As
no evaluated correspondence ci was paired with more
than one reference cr j, no evaluated correspondence
aggregation needs to be performed"

Which of the strategies was used to compute these scores?

2.8 It would be nice to also get the scores for the 5.5.2. example

2.9 in 6.1 "None of these systems consider the correspondence relation
or correspondence value."
How impactful is this?

2.10. In 6.2 the concept of instance-based precision is introduced. Couldn't also recall be computed?

3. Performance metrics

3.1. In 5.5.2 an example of the computation of scoring using queries should be given

3.2. Section 5.3 discusses scoring. This is a highly complex subject in complex matching. I believe the paper should provide a stronger motivation for the need of scoring metrics beyond classical precision and recall.

3.3. In pp12 6.1.3 "The query F-measure was preferred over other metrics
to be the scoring function. Indeed, it represents
how well the evaluated query suits the user needs in
comparison to the reference one"

This assumes both kinds of errors are equally important for users needs, that may not be the case. In fact, in semi-automated applications of complex matching, relaxing precision to increase recall may make sense, since it may be easier for the user to filter out incorrect mappings rather than perform exhaustive searches over both ontologies to produce the missing mappings. As a general-purpose scoring function, I have no problems with using F-measure, but the statement should make this clear.

4. Dataset and Evaluation

4.1. In pp13 7.1, how did you ensure that the fact that a single researcher created the CQAs is not a limitation/bias on the validity of results? Was there a set of criteria followed when creating the first set of CQAs?

4.2. I am unfamiliar with the use of the term "pivot format" as it is used throughout the paper. I believe a pivot format describes a data format that can be used to bridge two heterogeneous ones, and I am unsure of how this relates to the artifact mentioned in the paper.

4.3 Although the coverage of the ontologies over the CQAs was assessed, it appears that the coverage and complexity of the CQAs over the ontologies, i.e. how many entities of the ontologies are covered by the CQAs, was not. This may be relevant for precision evaluation. If a system is creating valid mappings for areas of the ontology not covered by the CQAs it may have a negative impact on performance.

4.4. In 7.3.1 the process of refining the CQA list is described. It would be great to have some numbers on this, detailing the original number of CQAs, and how this as changed in subsequent steps of the process.

4.5 Five different alignments were used in the Evaluation. They are not described at all, and the differences between them are not always taken into account when discussing the results. A small description of how the alignments were obtained (excepting ra1) would help understand the results without expecting the user to go to 3 different papers.

5. Others

In 4.2 it is not clear how the Hydrography and GeoLink evaluation is conducted. Is it manual or automated?

Typos etc

pp1 "However, simple correspondences are not fully enough"
pp6: " the focus is done on how this" --> the focus is on how this

pp13 "Based SPARQL INSERT" -> Based on SPARQL INSERT

pp14 "it has been partially populated." -> it has only been partially populated.

pp15 "The idea is to provide the same conference ontologies but with more or less common instances." This sentence does not read well, maybe use "partially overlapping set of instances"?

pp16 Another sentence that could be improved: "All the ontology concepts were not covered by the pivot CQAs." should be: Not all the ontology concepts were covered by the pivot CQAs.

pp16 (7.5) "a few things are needed" - two things are needed

The examples of D1 and D2 in page 2 could be given in a figure or table format so that they stand out more. They are used throughout the paper and not easy to find.

Review #3
Anonymous submitted on 22/Sep/2019
Review Comment:

This paper addresses the problem of automatic evaluation of complex alignment, by proposing a benchmark composed of an automatic evaluation system that relies on queries and instances, and a populated dataset about conference organisation with a set of associated competency questions for alignment as SPARQL queries. The authors cover two aspects in the system. The first idea is to evaluate the alignment based on the coverage of the CQAs. The second is based on alignment precision evaluation based on common instances.

Comments on the three dimensions for research contributions:
(1) originality: In my view, the quality of the solution, its impact, and the thorough experiments with very promising results makes this an original contribution to this field.
(2) significance of the results: The results are significant and have the potential to make a major impact on the state of the art in evaluation of complex alignment.
(3) quality of writing: The paper is very well written. It presents a hypothesis and evaluates it very well. I found it relatively easy to understand and follow, in part thanks to the good and real examples used throughout the paper. It is also positioned reasonably well in the literature.

Given the above, I find the paper acceptable as is.