Review Comment:
In this paper, the authors redefine three kinds of existing semantics of keys that have been proposed in the setting of the web semantic using description logics. These keys vary depending how multivalued properties are taken into account, and depending on how semantic mappings between classes and properties can be handled (i.e. either the mappings are considered in the linkage rule or handled using some rewriting step). They have formally compared the properties of each semantics, introduce the definition of generalized hybrid keys and propose new types of keys that vary depending on their validity in subsets of instance pairs (i.e. strong keys, weak keys or plain keys). One of the main contribution is that the authors show that link keys are more general than keys. More precisely, data interlinking with link keys cannot be reduced to data interlinking with keys and ontology alignments.
Strong points :
- The main different definitions of keys have been published in international conferences but in this paper, the authors clarify these notions, unify their formalization using description logics, give and prove their different properties (w.r.t. equivalence, class and property subsumption, class intersection or union) and theoretically compare them.
- Significant extensions are proposed to adapt their usage considering different scenarios and validity constraints.
- The paper is well structured and clearly written.
Weak points :
- The main definitions are illustrated by examples taken from a real dataset (INSEE).
However, there is no experimental evaluation that show that the number of correct links that can be discovered in real datasets that involve link keys is higher than in a scenario where alignments and keys are exploited. It would be very convincing to show the impact of the theoretical differences that are described using at least two heterogeneous datasets.
-More precisely, the relevance of defining weak/strong and plain keys is not so convincing. Strong and detailed arguments are needed. Here again, some experimental results (even simple) that can evaluate the impact of using these three types of keys would be really welcomed to support some of the hypotheses and to justify the fact (at the end) that link keys may lead to better results.
Minor :
- It would be really informative to describe the key semantics that can be used in existing linkage platforms that are mentioned in the related work section such as SILK and LIMES.
- You say p3 that “F-key are keys in relational databases”. But existing approaches have exploited such keys in data graphs. Furthermore, you have just explained in the same section that the problem in relational databases is different since properties are functional. I think that this sentence has to reformulated.
- In the section 4 , p 6, it is said that “eq-keys require some sort of local closed world assumption”. Then p 11, it is said that “local completeness is necessary for interlinking with eq-keys”. Since emptiness is not considered for eq-keys, you have to define what kind of local or partial completness is needed and if it is needed at the discovery step (validity).
- In section 4.2, p 9, give a simple counterexample that shows that (9) is not fullfilled for eq-key. I agree but it is not so obvious.
- Generalized hybrid keys are introduced but not used in the following definitions and properties. However many properties are valid for both key semantics, so for generalized keys.
- p 12. The sentence used to introduce plain link keys i.e. "a set of property pairs is a plain key for a pair of classes if it is ... will be linked" is not clear.
- the fact that link key can involve classes that are not semantically mapped (no subsumption or equivalence relations but intersections) can be said sooner.
- You have to introduce the weak, strong and plain keys by illustrating first the reasons why they will be efficient and meaningful in practice. I agree that sometimes strong keys are not numerous or not usable to interlink two datasets. However, If a name is not a key for a person in one dataset, why would it be correct to define and exploit it, when the key is only valid on the subset of persons shared by two datasets ? Is it because of their coverage ? How will you know that a weak or plain key is valid for new datasets ? (that the key condition will hold for the instances that will be linked). Give intuitions, other examples or scenarios, and maybe how to take dataset characteristics into account to decide.
|