Review Comment:
The paper presents a solution and its implementation in actual evaluation for evaluating complex alignments resulting of ontology matching. It is mostly based on providing populated data sets on which the extension of concepts and the results of query evaluation may be compared. This, in principle, avoids dealing with syntactic or semantic comparison of concept descriptions or queries.
This is a relevant topic that has not been much investigated due to the lack of matchers outputing such alignments and the difficulty to extend existing approaches to complex alignments. Now the maturity of the technology (matchers) and the need for complex alignments makes this work necessary.
The paper gives a clear view of the difficulty of the task and presents different ways this can be approached. It describes a significant amount of work, carried out openly, towards evaluating complex alignments.
I however see an outstanding problem in the paper: the analysis carried out in Section 4.1 is highly arguable. Indeed it presents a 'Generic workflow' for evaluation against a reference involving finding items that are individually compared and scored. This is indeed a general method that is widely used and raises problems when dealing with complex alignments.
However, this is not the only method. The work on semantic precision and recall performs evaluation against references without needing to decide what to compare to what: it is only necessary to evaluate if the reference entails each correspondence of the result (precision) or if the result entails each correspondence of the reference (recall). Such a mechanism has been implemented, used in OAEI and can be routinely exploited.
Concerning complex alignments, it has the advantage that any item that can be converted into OWL can be practically evaluated in this way (EDOAL offers such conversions and HermiT can compute entailment). For constructs going beyond OWL (typically value transformations) the notion of entailment of such constructs would need to be defined and implemented. This a limitation, but it does not justify to silently ignore the approach.
This is actually mentioned briefly in conclusion (as semantic-oriented approach), but only to discuss the limitations forgetting the actual benefits.
This comment does not aim at reducing the merits of the proposed approach but to call for a fair treatment and discussion of actual alternatives on the table, which do not suffer from drawbacks such as 'The variety of scoring function and all their possible combinations', especially that 'anchor selection and comparison are the most difficult steps to automate'.
There is another approach used in this paper that does not have this drawback: using reference queries instead of reference alignments. It is noteworthy that this also accepts a semantic treatment using query containment of the transformed queries instead of the comparison of query results. This is however mentioned in the paper.
There is however another way to resolve this issue: it is to make clear that such kind of evaluation is ruled out by the second part of the paper title: 'an instance-based approach'. Indeed, all semantic approaches (as in semantic precision and recall or in query containment) are not instance-based as they hold for any ontology population and not for a specific one.
But this should be better put forward in the paper, discussed and justified.
In particular, it raises the problem of dependency with respect to the ontology populations.
The positioning part could be more explicit (it is difficult to see what is the definitive difference when reading it) and then it should be clear that there are no reason to discuss other efforts and to carry out these all along Section 4.
Given this general remark, my suggestion oscillates between major and minor revision:
On the one hand, this remark calls for more than minor revision.
On the other hand, although it requires some reorganisation of the paper it is not difficult to implement and does not change its main content and contribution.
In addition, it seems strange to send a paper that received 'Minor revisions' back to a 'Major revision' stage.
Hence, I resolved to Minor revision leaving the decision in the hand of the editor (as usual but more than ever).
Other subject of improvement:
I should make clearer the difference between the _approach_ (with CQA for instance), that could be applied to simple alignments as well, and the _benchmark_ which comprises datasets, eventually populated, possible references and evaluation modalities.
Details:
p1: emergence of complex approaches -> emergence of matchers providing complex alignments
the basis for a range of other tasks -> the basis for other tasks
there is still a lack benchmarks -> benchmarks are still lacking
p2, l12: these datasets -> these cases
as for the CQA: CQA has not been introduced before AFACT
p3: will be referred as _A_ ''complex matching approach'', _A_ ''complex matching system''...
- 2.2: it is unclear to me what the meaning of the two occurrences of 'in the best case'.
p4: n-ary size 3 or more! n-ary is size n. You may consider variable arity queries.
- Tool-oriented. It seems to me that 'Resource-oriented' would be better.
- There is a misunderstanding about the Task-oriented evaluations. Such evaluations have not been set-up for evaluating task-oriented matchers (constructed for a given application or with a given task in mind), but for evaluating the suitability of general purpose matchers in a given task (so not because matcher designers had that task in mind but because application designers have a task to carry out). It is because such matchers can deliver standardised output that it is possible to replace one by another in a specific task-oriented test.
p5: (taxon) 'at least a': it is ot clear so far that several rewritings may be obtained.
- 3.3: 'with respect to model answer set' unclear what is meant and the following 'i.e.' does not seem to explain that.
p6: pertinent -> relevant
we proposed here -> we propose here (or we proposed there)
p11: prefer equivalence TO subsumption
p15: prefer one score than other -> prefer one score than/to ANother
- The decision that if a CQA cannot be rewritten make its query precision and recall to 0. is arguable. That's no big deal, but if these have to be evaluated, then precision could be set to 1. because no mistake has been made.
p16: 6.1 (6): 'an exception occurs' what do you mean? NullPointerException? Isn't it an inconsistency which is detected instead?
- 'change the interpretation of the ontology' how do you do that? is the term legitimate? Actually what is done is to add/suppress/modify axioms in one ontology. From a logical standpoint, one does not change interpretations but the theory which then accepts different models. This 'interpretation' term con be understood but it can also be misleading. Here it seems that
p17: 6.3.1: same as before with exception (twice)
Then the 'SPARQL INSERT queries have been modified in order to fit the new interpretation of the ontology'
It would be worth noting that the process described here may also be exploited to curate/improve ontologies.
p18: 'pseudo-determines': imprecise, what is meant by that?
p20: 'However, given this score seems low for a reference alignment': what does this mean?
p21, Table 12: four digits for precision and recall does not seem really significant given the size of the alignments.
p22: an overlap is -> if ?
last paragraph: repetition of however in two consecutive sentences.
p23: rewriting systemS are limited TO (s:c) correspondences
While it had to be done, we reduce -> While THIS had to be done, we reduceD
|