Survey on complex ontology matching

Tracking #: 2045-3258

Authors: 
Elodie Thieblin
Ollivier Haemmerlé
Nathalie Hernandez
Cassia Trojahn dos Santos

Responsible editor: 
Marta Sabou

Submission type: 
Survey Article
Abstract: 
Simple ontology alignments, largely studied in the literature, link a single entity of a source ontology to a single entity of a target ontology. One of the limitations of these alignments is, however, their lack of expressiveness which can be overcome by complex alignments. While diverse state-of-the-art surveys mainly review the matching approaches in general, to the best of our knowledge, there is no study about the specificities of the complex matching problem. In this paper, an overview of the different complex matching approaches is provided. It proposes a classification of the complex matching approaches based on their specificities (i.e., type of correspondences, guiding structure). The evaluation aspects and the limitations of these approaches are also discussed. Insights for future work in the field are provided.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Catia Pesquita submitted on 30/Nov/2018
Suggestion:
Accept
Review Comment:

The authors have addressed all of my comments satisfactorily.

I found three minor typos:
1.In "While some approaches [56, 98, 99] rely on a
semantic tree derived from the schema The approaches" - "The" should be replaced by", the"
2.In "Genetic programming is a way of finding complex correspondences between data properties" - "is a way of" should b replaced by "can be used for "
3.In "apply the genetic algorithm to alignments as its "individuals". "the" should be "a"
"

Review #2
Anonymous submitted on 31/Dec/2018
Suggestion:
Minor Revision
Review Comment:

This paper has benefited from a thorough re-writing. I am very pleased that many of my comments and suggestions were addressed.

There are however some issues that remain. Some that I believe have not been answered in the most appropriate way. And some introduced by the new additions or removals. Details can be found below.

Overall, however, I must clarify that the paper is now closer to being fit for acceptance. Probably close enough so that I wouldn't want to be involved in a next review step. Especially, I don't want to hold authors' pen for writing correct and clear sentences. I must warn the authors here that while the paper is correct in explaining the rather complex matter it sets to tackle, it is far from being a pleasure to read it. The reader will have to do a lot of parsing to understand the intention behind some sentences that are probably badly translated from French. I am still puzzled to find some wording issues such as the super-awkward last paragraph of p29 (repetition of "in complex matching", twice "related to" that just do not fit in a same sentence like this, wrong use of "however") and the many passive forms which sometimes makes sentences wrong from a grammar standpoint (e.g. "in the following, are provided" on p3). This hints that *every author* should proof-read the paper before it is submitted.

On specific comments I have already made (of course I am not repeating comments that were successfully addressed!):

[
Another puzzling point is the position of the survey with respect to ontology evolution. It is listed as a possible case of application of complex alignments on p1, but then is presented as being out-of-scope for the survey, without much explanation.
]

I appreciate that the new 2.4.2 gives explanations that address my worries. However, is so much explanation (and so many references) required for something that is out of scope? Can't this be shortened a bit? This issue is not specific to *complex* ontology matching per se.
Furthermore, the title of 2.4 'scope definition' should perhaps be 'scope clarification', as the section does not define the scope by itself, it merely makes it more precise.

[
the order of the tools analyzed should be more consistent across table. The currently (semi) random ordering does not help the reader.
]

The situation is better now, but I am still puzzled why tables 6-8 are primarily structured by the type of output, while section 5 is primarily structured by the 'guiding structure'. This should be homogenised, as section 5 is what readers will see to be the substrate of tables 6-8.

[
p23-24: I am not sure the paper needs an independent conclusion as it is currently written. It could perhaps be merged with section 6.
]

I am surprised to see that a lot of the material that was previously in the discussion section has been removed. Maybe this was requested by other reviewers. But I would like to clarify: I had seen some minor issues with the arguments made in that section, but I was not arguing for such a drastic reduction. A mere merging of sections 5 and 6 (in the first submitted version) would have been enough for me.

[
p16: Lagramge -> LAGRAMGE.
]
Now that I've checked the original reference [73] I realize that the algorithm is not a reference to Lagrange, but to a specific algorithm developed by the authors of [73]. So I am re-writing my suggestion. Using all-caps would help disambiguating this reference, and at the same time following the way the authors of [73] do write about their system!

New comments:

- in the entire paper, the symbol for the "more general" and "more specific" links read well in the PDF, but for some reason they printed badly on paper. This is probably an issue on my side, but a double-checking may be useful.

- p1: "on the Linked Open Data (LOD)" probably misses "cloud" or maybe it should be "as Linked Open Data". LOD is a rather a technological paradigm than a set of datasets.

- p2: after looking at the paper again, I am not sure why the reader should be presented with the diagrams such as fig 1. They are not easy to decipher for these who are not familiar with [17], and the convention is used only once in the entire paper (in fig. 4), making the effort not much worth it.

- I have seen that the confidence measure has been removed from the definition of correspondences. This seems to be a response to another reviewer comment. But I am puzzled: are confidence measures used in none of the studied matchers, really? If some use them, then it's worth adding them back in the definition, and explaining in the text that confidence is used in some matchers, even though it's not explicitly (and formally) noted in this paper's text.

- p3: the definition of ontology matching is not worth a formal definition (definition 3) imo. The one paragraph that explained it in the previous version (at the beginning of section 2.3) was better.

- p4: I'd argue that it is wrong to state that "business process models [...] do not model knowledge related to a particular domain" as 2.4.1 hints (but maybe it's just a matter of wording hard to understand for me).

- p5: DDL and DFOL are not explained

- section 3.1 has some homogeneity problems: in 3.1.2 it is not clear why there is a mention of a specific syntax for SWRL and not the other vocabularies. There should be none, or there should be a syntax (or syntaxes) mentioned for every vocabulary. At another level, the author classify some languages ("DB to Ontology, Logic, Transformation" for R2RML) but it's not clear why they apply it only to some languages and not everywhere in section 3.1. And it does not seem consistent with table 1. Plus, it's actually unclear what the classification means. Do "logic" and "transformation" refer to the "logical relations" and "transformation functions" more formally introduced only later in 4.2?
Likewise, the 'targeted application' of table 1 do not seem to be introduced in an explicit way. Do they come from existing work (which could be the case, as they are not specific to complex matching)?

- p6: the last paragraph of 3.1.5 does not read to be appropriate at this level in the paper, as it's about quite a general convention.

- p11: in table 3 shouldn't o2:reviewWrittenBy(y,z) be o2:reviewWrittenBy(z,y)?

- p22: I don't see why the general intro for KARMA should be repeated (it's already on p20)

- p28: some of the analysis is still probably right but it reads strange now that the state-of-the-art on various individual evaluations has been removed!

- p29: ODBA should be explained

Review #3
By Antoine Zimmermann submitted on 22/Jan/2019
Suggestion:
Major Revision
Review Comment:

The new version of the paper addresses many of the issues identified by the reviewers in the first round of evaluation, including many of mine own. I think that this work should eventually be published. However, some issues unfortunately remain unsolved, in spite of having been identified previously. The main points are the following:

1) As I suggested previously, a section on alignment representation has been added. This is new material not previously reviewed. I find this part to be rather disorganised. As an example, the section puts together logics that are not alignment representation languages, with dedicated alignment formats, and labels all of them "vocabularies".
2) The syntax and structure of correspondences used thoughout the paper still don't follow the definition. Some correspondences are not even looking like FOL anymore (e.g., c2 in Sec.2.3). A consequence of this is that it becomes difficult to understand what a complex correspondence is. With the definition of correspondence as a triple (e1,e2,r), a correspondence like (o1:C,\exists o2:R.\top,\equiv) is definitely complex. In FOL-like syntax, it becomes (using the same notation as the authors) \forall o1:C(x) \equiv \exists y o2:R(x,y). This seems to satisfy the definition of simple correspondence, as given in Sec.2.3, since it only involves atoms on each side of the equivalence.
3) What structures are being aligned by the various matchers and matching techniques are still vague and ambiguous. What an ontology is according to the survey is still unclear. On the one hand, the term "ontology" is said to be used, "in this survey", in a broad sense; on the other hand, it seems to exclude database schemas, XML schemas, taxonomies. Not knowing the exact structures or languages or logics on which the referenced techniques work is a real limitation of the survey. Whether or not the referenced papers define the term ontology explicitly is irrelevant. If the techniques can be used to provide concrete resuts, it means that they implicitly or explicitly assume a certain input structure. It is the role of the surveyors to lift ambiguities and make the implicit explicit, from the surveyed articles.
4) other little things, either present in the original version and not noticed before, or that have not been changed in spite of being mentioned in previous reviews, or that have been added in the new version, in accumulation, make the required changes sustantial enough to request another major revision.

All changes made to the paper shoud be documented in the response to reviewer, or highlighted in the new version, e.g., by putting the added/modified text in red. After realising that some text changed or moved without mention in the response to reviewers, I undertook a thorough comparison of the two versions, which led to me to take note of every problems I found in the paper. This explains my overly long late review. I'm sorry if the comments at times seem a bit too zealous, but since I recorded these notes, I think they are more useful if sent to the authors in full.

In their response to reviewers, the authors justify multiple times their prose by arguing that this is how something is explained in the reviewed papers. Namely: "The authors wrote Conceptual models ..." [in response to Reviewer 1 to justify the use of the phrase "Conceptual models"]; "Fig.2 was adapted from a work in [4], which was also quoted in [2]" [to justify the figure]; "few papers [...] give a definition of ``ontology''" [to justify the use of the term "ontology" in a really fuzzy way]; "it is the notation used by the authors" [to justify the use of strange notation].
--> a survey should not rephrase what the surveyed papers say. It should (re)interpret what the authors of said papers mean, translating the heterogeneity of the individual contributions to a unified framework for analysis and comparison.

Detailed comments:

Introduction:
- "The aim of this survey is to provide an overview" -> I don't think a paper can be an overview and a survey at the same time. As a survey, it can provide an overview, but this should not be the "aim" of the paper.
- "over-viewed" -> "reviewed"

Background:
- Sec.2.1:
* "(e.g., Database schema ...)" -> "database schema"
* In their response to the reviews, the authors point out that they take inspiration from Uschold & Gruninger 2004, as well as Euzenat & Shvaiko 2013. Both references provide a figure similar to Fig.2, but more precise and better explained in the surrounding text. From what I understand, the figure is not needed and not useful for this survey. It suffices to say that the term "ontology" is sometimes used by other authors in a broad sense to include taxonomies, database schemas (especially relational database schemas), XML schemas, and formal ontologies. In this survey, ontologies are distinguished from the other stuff and should be understood as axiomatised theories (if I understand it well?)
- Sec.2.2, 1st paragraph:
* rephrase to active form "Before ..., we give the definition of ..."
* "\forall x, o2:accepted(x,true)" cannot be the expression of a member of a correspondence because the quantifier must not be scoped to the member. See remark below on Def.1 (already mentioned in my previous review). A FOL expression "\forall x A(x) \to B(x)" cannot be interpreted as a triple (\forall x A(x),B(x),\to).
- Sec.2.3:
* the section immediately starts with a definition, without any introductory text (which was not the case in the 1st version). There is certainly no need to put the notion of "ontology matching" in a "Definition environment". The section could just point out that "Ontology matching is the process of generating an ontology alignment between a source and a target ontologies. An ontology alignment consists of correspondences. We formally define these notions here." (or something along these lines)
* Athough Def.1 now avoid using an explicit triple (e1,e2,r), it still defines a correspondence as being composed of 3 pieces. The rest of the paper still fails to follow this definition in its examples of correspondences. If the authors prefer to stick to the pseudo FOL notation, then they must change Definition 1, or even get rid of the formal definitions altogether. Instead, an informal intuition of what the FOL-like formulas mean could be given. In any case, whatever formal definition is used, the paper should stick to it.
Note (again) that the symbol \equiv is typically not used in FOL formula. What should be used instead is \leftrightarrow, or \Leftrightarrow. The symbol \equiv is typically used as a meta logical symbol for the logical equivalence of two formulas (as I explained already in my previous review). It is also used in description logics. From the given definition of correspondence and the use of \equiv, an inattentive reader may think that \forall x AcceptedPaper(x) is one member of a correspondence, which it definitely isn't. The symbols \leq and \geq are used here to mean "more general" and "more specific", but the paper only uses these symbols in FOL formulas in place of implication. So having \to (implication) as well is bizarre.
* "are provided some examples" -> "we provide some examples"
* "and the motivating example ontologies (Figure 1)" -> these examples are introduced here, so they are not really "motivating examples". Could be rephrased as "and the example ontologies presented in Figure 1"
* c1 could be expressed as "c1 = (o1:Person,o2:Person,\equiv)"
* "c2 = (o1:priceInDollars,changeRate\times o2:priceInEuros,\equiv)" or even "(o1:priceInDollars,changeRate(o2:priceInEuros),\equiv)" the syntax of complex expressions completely depends on the alignment formalism used. The meaning of such expressions, and of correspondences in general, can also vary. In the case of this paper, all examples could be expressed in a DL-style syntax, with functions on concrete values.
* "c3 = (\exists o3:hasDecision.o3:Aceptance,o1:AcceptedPaper,\equiv)" using a DL-style syntax
* "c4 = (o1:writtenBy,o2:authorOf$^-$,\equiv)" using a DL-style syntax again
* "c5 = (\exists o2:accepted.{true},\exists o3:hasDecision.o3:Acceptance)"
* footnote 1: in a network of aligned ontologies, it may be assumed that local axioms (e.g., the domain assertion in ontology o3) only express truth in the context of their local ontology. According to this view, the consequences of the domain assertion in o3 do not necessarily carry over the alignments. So the choice whether to specify o3:Paper explicitly is not insignificant. This is one reason why using FOL may not be appropriate, as it suggests too much what should be the semantics of alignments. Alignments could be interpreted as DDL bridge rules instead, for example.
* "In opposition to" -> As opposed to
* I don't think Definition 3 is required (see remark above)
* "The approaches for generating such a kind of alignment are discussed in the next section" -> at this point, "the next section" can be interpreted as Section 2.4, which does not discuss the approaches.
- Sec.2.4:
* "both the two variants" -> "both variants"
* "Protege" -> "Protégé"
* "according to the entities evolved" -> ... entities involved
* "to multiple simple changes''[33]" -> space before ref
* "on the link between both tasks" -> "between the two tasks"

Sec.3:
This section is new and haven't been reviewed before. It should provide examples of how some complex correspondences can be expressed using the various options.
- Sec.3.1:
* "Here are presented" -> Here we present
* "Many usual alignment languages ... (DDL, DFOL)" -> DDL and DFOL are not alignment languages. DDL/DFOL bridge rules may be used as ways of writing and interpreting correspondences.
* Logic syntaxes: I don't understand why this category only mentions Web-PDDL, Datalog and RIF, and neither FOL, nor OWL nor SWRL. Strangely, OWL and SWRL are in the category "vocabularies" (which they are not). DDL / DFOL and other "distributed" or "contextual" logics have constructs (like bridge rules) that can be specifically used for expressing correspondences. Hoever, such logics offer a formal semantics, so that they are ways of interpreting the correspondences as much as ways of writing them. For instance, interpreting a correspondence as a DDL bridge rules does not yield the same entailments as interpreting them as DL axiom.
* Vocabularies: assuming OWL and SWRL are excluding from this group and put with logic syntax, then the rest is entirely about dedicated ontology alignment formats/languages/vocabularies. The title of the subsection should reflect this notion. This part should mention the Alignment format from Inria's Alignment API. This format describes alignments globally, as well as correspondences, and allows one to embed any expression language for complex alignments. EDOAL is one such expression language that is supported by the Alignment API.
* "R2RML DB to Ontology, Logic, Transformation" -> it's a language for expressing mappings from Relational DBs to RDF, no more, no less.
* I would have expected in this section a description of the output formats of complex alignment tools. While I'm sure many use the Alignment format of the Alignment API, they may not always rely on a generic alignment language for the representation of member expressions.

Sec.4:
- Sec.4.2:
* "However, some approaches are able to generated" -> to generate
* In the description of the tree-based approach, the example from genetic programming has been removed, and the genetic programming approaches are no longer classified as tree-based. Yet, the definition has not changed, in spite of what the authors say in their response to Reviewer 1.
* in the definition of "fixed to fixed": "As shown in 3" -> "As shown in Figure 3"?
* It is strange that Fig.2 no longer has the member expression categories, especially since correlations are now shown and correlation between "fixed to fixed" and "atomatic pattern" is mentioned in the text. As far as I can see, this has not been explained in the response to reviewers.

Sec.5:
- In spite of my previous remark on the vagueness of the "types of knowledge representation models", this section is still relying on non-informative descriptions of the model. This is a clear limitation of the survey. Mentionning that a complex ontology matching approach aligns "ontology to ontology" is not useful at all, unless "ontology" has a restricted meaning that's clearly defined (which is not the case in this paper). The surveyor should extract from the surveyed paper the precise knowledge representation formalism that the approach relies on. E.g., it would be much more helpful to say that Ritze et al. [68,69] work on DL ontologies. Then it is not necessary to check the papers to know that the approach can be used on OWL, but not on arbitrary FOL ontologies.
- Interestingly, [69] provides example correspondences that are exatly the same as those used by the surveyors, yet they are represented by triples with true standalone member expressions, using DL-style syntax. It is puzzling that the authors have not simply followed those notations and be in line with their definition of correspondence.
- Sec.5.1:
* Example 1 used to have "\sqsubseteq" as a connector, which was replaced with \leq. Note that \leq is not more conventional than \sqsubseteq as a notation for implication, and instead suggests that the p(x,y) may denote numbers, or elements of an ordered set. Again, the example comes from [69] where DL-style syntax is used to express member expressions.
* Ref [22], which was previously considered out of scope, is now in this section. Svab-Zamazal and Svatek, that used to be in this category, have moved to another category. Other references have changed category since previous version, not all being mentioned in the response to reviewer. This reshuffling of references across categories brings suspicion about the accuracy of the classification. Either the criterias for classification are not crisp enough, or the approaches were not carefully reviewed.
* The correspondence in Ex.3 cannot be represented directly with DL member expressions, but with a rolification of BirthEvent, it can "(o1:birthDate,frame:hasSubject^-\circ rolify(BirthEvent)\circ frame:hasDate,\equiv)" where rolify(BirthEvent) such that $BirthEvent \equiv \exists rolify(BirthEvent).Self$ and $Func(rolify(BirthEvent))$
* in the description of Bayes-ReCCE, there is again \leq (this time used as a replacement of \sqsupseteq, in contrast to Ex.1)
* in the description of KAOM, "coeff" is written different the first and second times it appear
* BootOX is said to match "database schema to ontology". Again, it is specifically for mapping *relational* DBs to *OWL* ontologies.
- Sec.5.2:
* "in the ontologies by with the help of"
* The example given at the end of Svab-Zamazal and Svatk is absurd, as stated. It should read $\forall x,z1 (\exists y,z2 giveReview(x,y)\wedge reviewOfPaper(y,z1)\wedge reviewAppreciation(y,z2))\to reviewsPaper(x,z1)$ and the member expression for o1 could be expressed as $giveReview\circ rolify(\exists reviewAppreciation)\circ reviewOfPaper$, for instance
* There are still many examples in the text that have the form of generic patterns. It makes it more difficult to distinguish between symbols for variables and symbols for constants, and it is hard to recognise what are specific examples and what are patterns of correspondences.
* Xu and Embley does not work on every schema models. It assumes source and target schemas are given as conceptual-model graphs (not tabular, etc). Then there are other approaches following this where the "knowledge model" is identified as "database schema" without specifying which type of databases (most often, those approaches work on relational databases)
- Sec.5.3:
* An et al. uses correspondences of the form u \equiv s instead of FOL formula, in spite of having argued that you would use FOL, and arguing in the response to reviews that notations have been harmonised.
* Ex.10: "o1:Person is the domain o1:email" -> domain of (same next line)
- Sec.5.4:
* Ex.12 and 13 use yet another notation for correspondences ("using the authors' format"). Why should these particular cases use the authors' format when no other cases do?
* KARMA was added to this section, but also to Sec.5.5, with very similar, redundent text.
- Sec.5.5:
* Nunes et al. and de Carvalho et al. were moved from Sec.5.4 to there, although they mention trees in their description
* there is a broken formula in Ex.14: "\forall x,y, \to o2:reviewerOf(x,y)"
* the formula at the end of Ex.14 does not make sense: it says that anything that writes a review is a reviewer of everything
- Sec.5.6
* "summarized" -> "summarised" to be consistent with the rest of the paper

Sec.6:
- Sec.6.3:
* there are missing "x"s after the first two \foralls
* "pattern-based ontology formats" -> what's this? "such as EDOAL" -> to say that EDOAL is an ontology format is quite arguable.
* "occurence restriction" -> don't you mean "existential restriction"?
* "a type restriction" -> what's this?

Sec.7:
- "The notion of contextual alignment [120]" -> the reference to 120 is misleading as it does not mention the notion of "contextual alignment"

References:
- [9] where has this been published?
- [29] "Volkel" -> "Völkel". "rdf" -> "RDF"
- [30] "et al." -> there is no other author. There is "2002" in bold face (probably this has been used as volume or number in BibTeX)
- [39] "et al." -> there is no other author. This reference has number 10 in bold face, followed by "(10)". Something must be incorrect in the BibTeX metadata.
- [40] "et al." -> idem
- [41] -> the reference should indicate what type of publication it is. It's a project deliverable (Knowledge Web).
- [42] -> put the name of the workshop in lower case. There is a missing closing parenthesis
- [47] -> this is a W3C recommendation
- [48] -> remove extra dot at the end of title
- [50] -> it is certainly not a Citeseer technical report
- [52] -> missing capital letters. "et al." -> there is no other author. A reference to the current XSLT standard (XSLT 3.0) would be better
- [53] -> capital "P" for Protégé
- [55] -> delete dot at end of title. This was published at SIGMOD, not "ResearchGate"
- [57] -> there is an additional "1" at the end of the title
- [64] -> delete dot at end of title
- [67] -> idem + where published?
- [117] delete dot in title