Learning Expressive Linkage Rules from Sparse Data

Tracking #: 1831-3044

Authors: 
Petar Petrovski
Christian Bizer

Responsible editor: 
Jérôme Euzenat

Submission type: 
Full Paper
Abstract: 
Identifying entities in different data sources that describe the same real-world object is a central challenge in the context of the Web of Data as well as in data integration in general. Due to the importance of the problem, there exists a large body of research on entity resolution in the Linked Data as well as in the database community. Interestingly, most of the existing research focuses on entity resolution on dense data, meaning data that does not contain too many missing values. This paper sets a different focus and explores learning expressive linkage rules from as well as applying these rules to sparse data, i.e. data exhibiting a large amount of missing values. Such data is a common challenge in the context of the Web as different websites describe entities at different levels of detail. We propose and compare three entity resolution methods that employ genetic programming to learn expressive linkage rules from sparse data. First, we introduce the GenLinkGL algorithm which learns groups of linkage rules and applies specific rules out of these groups depending on which values are missing from a pair of records. Next, we propose GenLinkSA, which employs selective aggregation operators within rules. These operators exclude misleading similarity scores (which result from missing values) from the aggregations, but on the other hand also penalize the uncertainty that results from missing values. Finally, we introduce GenLinkComb, a method which combines the central ideas of the previous two into one integrated method. We evaluate all methods using six benchmark datasets: three of them are e-commerce product datasets, the other datasets describe restaurants, movies, and drugs.We show improvements of up to 16% F-measure compared to handwritten rules, on average 12% F-measure improvement compared to the original GenLink algorithm, 15% compared to EAGLE, and 8% compared to FEBRL.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 15/May/2018
Suggestion:
Major Revision
Review Comment:

Summary: The authors present an approach for the pertinent problem of linking RDF data with missing entries. The problem has been considered in (Zhu et al., 2016; see abstract) a.o. The approach presented in the paper is a relatively straightforward extension of GenLink. Still, the results can be pertinent for real-world applications. The paper demands a significant amount of work (especially w.r.t. formal details and evaluation) to be worthy of publishing. The quality of writing is fine. My detailed comments and requests to the authors are below.

Abstract: The authors address entity resolution on sparse data, i.e., on data with "a large amount of missing values". The authors propose two core algorithms and a combination thereof to learn expressive linkage rules for entity resolution. Their results suggest that they outperform the state of the art they considered.

Introduction: The authors motivate their work with applications from e-shops, where product descriptions are incomplete and vary across shops. This is indeed an important restriction when considering real data. This restriction has also been considered in (Zhu et al., 2016) ([39] in the paper), where the authors use a graph summarization approach to address the problem of partly missing values for link discovery. The authors then go on to present an appraisal of GenLink, pointing to its top performance above 95% F-measure. I do wonder whether this is necessary but if the authors want to keep this performance, they should point out on which datasets the performance is achieved as their performance claims contradict some of their later results.

Problem statement
- Notation: e_{a, b} = {(p_i, v_i)} is actually incorrect. The model should read "Each e \in A \cup B can be represented as a set of attribute-value pairs ...". Please correct.
- Notation: e_{a, b}: You run different indexes on p and v in your definition (e.g., p_1, v_2). Hence, (p_n, v_n) cannot be correct. Please fix.
- "Find the subset M": What is M is subset of? Do you mean the set of pairs? Please specify.
- Pairs of entities for which equality holds: e: e_a is not equal to e_b. owl:sameAs is an equivalence relation but not the equality relation. Please define clearly what you mean by equal (feel free to override the meaning thereof but please make sure it is formally sound).
- Find its complement U. Isn't finding M equivalent to finding U as M = (A \times B) \setminus U?
- Definition of R+: The two resources e_a and e_b are not equal. Please fix.
- Definition of R+: Formally the definition is unsound, as there is no guarantee that you can learn a ML model which generated M and U such that R+ \subseteq M and R- subseteq U. Please fix.
- Equation (3): You use * and \times to denote the cross products. Please stick to one (preferably \times).
- Linkage rule format: While the linkage rule description is appreciated, it still leaves space for interpretation. Please present a formal grammar for the linkage rules you use to ensure that the reader can reproduce your implementation if need be.
- The authors also claim that their linkage rule approach is "rather expressive" when compared to other grammars. What does this mean exactly? Can the author provide a formal interpretation of the statement?
- Equation (5): Again, please stick to \times or *.

Approaches
- The authors claim that GenLink could perform well if one removed the penalty for long specifications. Please add this extension of the orginal algorithm to your evaluation.
- GenLinkGL
* Group definition: The authors select the fittest individual. I guess they use the same initialization as with GenLink.
* Group fitness: Definition unclear. Please write an equation to clarify. Do you mean you compute the union of the outputs of the groups and measure the fitness thereupon?
* Constant c: How did you establish the relation between c and the population?
* Rule selection: The rules for matching are determined per pair. This means that all pairs from A \times B have to be compared. This is impractical for large datasets. Please report the runtimes of your algorithm in the experiments.
- GenLinkSA
* The formal specification of GenLink aggregations is unclear. The sets S*, N* and F^a are not defined. The meaning of the arrow notation under the sets is also clear. Please rewrite or specify.

Experiments
- All results were averaged: Please provide standard deviation values. Moreover, given that you use genetic programming, please run statistical tests to state whether your average behaviour is significantly different from that of other algorithms. I would suggest running a rank test given that the experiments are ran 10x.
- Interestingly, approaches such as (Nikolov et al., 2012) suggest that they need no training data to perform well on many of these datasets. So does (Zhu et al., 2016). Please compare with these approaches. Moreover, please consider the datasets in (Zhu et al., 2016).
- Why did you not consider removing the name property?
- How was the writing of the handwritten rules carried out?
- The authors compare with EAGLE, FEBRL and MARLIN. However, both algorithms have already been shown to be lacking in previous works. Please compare with more state-of-the-art approaches, e.g., Zhu et al. (2016) and Nikolov et al. (2012).
- The authors claim that EAGLE has *significantly* lower results. How was significance measured?
- By a significant margin => Significance test?

State of the art
I commend the authors for having already compared with EAGLE, FEBRL and GenLink. However, these approaches are now more than 5 years old and are known to be outperformed by other algorithms. In particular,
- (Zhu et al., 2016) present an unsupervised approach for learning on RDF data. Please also consider the dataset they use.
- (Nikolov et l., 2012) present a genetic unsupervised learning approach

References
- "Can be unsupervised": One of the first works in the Semantic Web on this topic was by Nikolov et al., 2012. Please add it. Moreover, please add a reference to (Zhu et al., 2016) (i.e., [39]) here as it achieves a remarkable performance.
- Some references are unclear to me, e.g., Supervised [29]. Ngonga Ngomo et al., 2011 does not consider machine learning. This reference seems misplaced.
- LIMES [29]. The LIMES framework is described in (Ngonga Ngomo, 2012). Please replace the reference.
- Combination of the two [17]. Can you please explain how [17] does not fall into supervised approaches?
- How does one combine supervised and unsupervised approaches? The two paradigms are mutually exclusive.
- Please cover active learning in the state of the art.

Typos
- in the Linked Data => in the Linked Data Web?
- same real world object => same real-world object
- which is iteratively evolved => which is evolved iteratively
- gtin number => gtin
- footnotes after punctuation
- while still be able => while still being able
- low density properties => low-density properties
- top fitness individual => fittest individual
- thee => three
- Beside of comparing => In addition to comparing
- systems[5, 9] => systems [5, 9]
- there have been identified 112 duplicate records => 112 duplicate records were identified.
- main difficulty => main weakness
- "pivoting around" => rephrase, algorithms are not mobile
- MRLIN => MARLIN

Papers
- Linhong Zhu, Majid Ghasemi-Gol, Pedro Szekely, Aram Galstyan and Craig A. Knoblock (2016). Unsupervised Entity Resolution on Multi-type Graphs. ISWC 2016 (Best Paper Award)
- Nikolov, Andriy; d'Aquin, Mathieu and Motta, Enrico (2012). Unsupervised learning of link discovery configuration. ESWC 2012
- Ngonga Ngomo, Axel-Cyrille. On Link Discovery using a Hybrid Approach. Journal on Data Semantics 1(4):203 - 217, 2012

Review #2
Anonymous submitted on 20/May/2018
Suggestion:
Minor Revision
Review Comment:

Summary

The paper addresses the important problem of entity resolution in situations when records have significant numbers of missing attributes. Traditional methods often treat missing values as non matches, assigning a similarity of zero.

The authors propose two ideas. One idea is to generate a family of rules tuned to subsets of attributes thus avoiding the issue of missing values in each rule. The second idea is to scale the weights assigned to each feature (comparison between attributes) so that stronger similarity between available attributes is required to consider a pair a match.

The first idea produces better recall as the matches from each rule are unioned, and the second idea produces better precision as it requires stronger similarity when values are missing. The authors then propose a method to combine both ideas to achieve better F score. The evaluations support the contribution of each of the ideas.

The authors present their ideas in the context of their previous work on GenLink. While this is convenient, and it is how the algorithms were implemented, the ideas are independent of the previous work, and their contribution would be stornger if they were cast as general ideas. THe GenLink implementation is an implementation of the general ideas.

The paper is intersting, and should be published after minor revisions.

(1) originality

The idea of searching for a family of rules is novel and intersting. The idea of adjusting the weights of the aggregation is intersting but simple, and not fully explored (eg, influence/sensitivity on beta parameter). The idea of combining the two approaches is original and interesting.

Overall, originality is good.

>> blocking: there is no mention of blocking, and it deserves at least some discussion as blocking is necessary for real world datasets, and blocking may also be affected by sparsity.

(2) significance of the results

>> influence of hyper-parameters:

hyper-parameters: c in rulecount, beta in selective aggregation.

How were the hyper-parameters tuned? Were they tuned for each dataset, or were the same setting used for all datasets? How were they tuned?

What is the influence of the hyper-parameters on the results?

>> learning times: there is no discussion of the learning times of the algorithms. This is important as genetic algorithms are very slow.

>> hand-written rules: the paper includes a vague reference for the rules being written by an expert. This is too vague as a determined user can write very good rules. In fact a determined user can write decision tree rules to deal with sparse values. Was this done? Would be good to list the hand written rule for one of the datasets.

>> configuration of baselines and other systems: for replicability, the configuration of the other systems should be discussed and perhaps presented. Was the default configuration used, was there an attempt to optimize it, eg, for FEBRL and EAGLE.

>> sparsification: random sparsification may unfarily hurt the machine learning (ML) approaches as as sparsity in real work datasets is not random and ML could identify it. As

>> aside notes:

The paper mentions the importance of learning non-linear rules, and interstingly the main examples are linear, first example in fig 2 and fig 3, there is only one example of max

>> comments on related work:

Machine learning has developed many techniques to deal with missing values. Examples include methods that tolerate missing values and missing value imputation techniques that work well for numeric values and categorical variables with few categories.

In traditional ML, a data scientist spends a significant amount of time developing features, selecting models and tuning hyper-parameters. The process leads to dramatic improvements over the first model that users try (eg, run SVM). A particular focus in this process is handling of missing values, through imputation, or by defining additional features (eg, a new indicator feature that specifies whether a value is present or not). While this process is not automated, a savvy data scientist can in a few hours achieve excellent results.

GenLinkSA uses a very simple formula to compute the score of a feature vector generated for a pair of records (using the beta parameter). It is likely that ML methods could learn a much better formula.

This possibility makes me think about the significance of the contributions. It is true that the simple algorithm presented in this paper improves the state of the art, but it also suggests that more sophisticated methods could do better. As this type of tuning is common in ML, it should be mentioned in the paper.

(3) quality of writing

The paper is clearly written for the most part, but does contain a number of grammatical errors that should be cleaned up.

>> The following sections are not clearly written in the sense that the ideas are not expressed precisely.

The GenLink overview in page 2 is somewhat vague. The authors refer the reader to the GenLink paper, but the overview should be precise. The crossover paragraph is too short and imprecise, eg "selects one operator at random in a parit of linkage rules" ... does this mean one operator in each rule? Do they have to be of the same kind (I suppose so). The next sentence talks about aggregation operators so it is not clear whether crossover applies to all 4 types of operators. A small amount of work can make the crossover section clear.

Group application: is the sorting of the rules done statically based on an analysis of the training data, or is the sorting of coverage done for each pair being tested? The examples suggests the second option, please clarify.

>> Some parts are written precisely, but not explained fully.

Equation 3 is complex and not explained.

Intuition of equation 4 is not provided, a sentence would be enough.

Equations 6 and 7, although precise seem overly complicated to express a simple idea. As it stands, these equations are hard to follow.

Review #3
By Oktie Hassanzadeh submitted on 21/May/2018
Suggestion:
Accept
Review Comment:

The paper presents a solution to the problem of matching entities across data sources when entities have a diverse and "sparse" set of attributes and so classic entity matching solutions fail. Such diverse set of attributes is very common in practice, particularly when matching entities across Web data sources. So although the solution does not rely on an ontology and is not a purely "semantic web" solution, it is directly applicable to matching (to/between) knowledge bases across the web. The solution is implemented within the Silk framework which is already widely used in the semantic web community.

Comments on the three dimensions for research contributions:
(1) originality: this might be the weakest aspect of the work given that the algorithms are extensions of the authors' previous work. However, in my view, the quality of the solution, its impact, and the thorough experiments with very promising results makes this an original contribution to this field.
(2) significance of the results: The results are significant and have the potential to make a major impact on the state of the art in entity matching.
(3) quality of writing: The paper is very well written. It presents a hypothesis and evaluates it very well. I found it relatively easy to understand and follow, in part thanks to the good and real examples used throughout the paper. It is also positioned reasonably well in the literature.

Given the above, I find the paper acceptable as is (except for minor clarifications needed, see the first issue below), but I strongly recommend the authors to provide a response to or address the following comments I have on the paper:

- First (this one needs to be fixed prior to acceptance), Results in Table 6 show you use "gtin", but it is not listed as one of the properties in Table 2.

- The only major flaw I see in the evaluation, which is related to the above issue, is lack of proper "ID" columns/properties in the data sets. Do you believe this is common in practice? You need to clarify the issue with "gtin" but were you able to retrieve any identifiers for the data sets? Perhaps they were used to build the ground truth?

- If possible, provide more details on how the ground truth labels are derived.

- Another issue I can see in the evaluation is that you do not have a baseline designed for the scenario you are addressing. You have baselines (TF-IDF or paragraph2vec) but they are the solutions one would use for dense data, not sparse data, as with all the other related work you compare with. One simple baseline you could try is the idea described in Section 5.6 of the following paper:
Hassanzadeh et al. Discovering Linkage Points over Web Data. PVLDB 6(6): 444-456 (2013)
http://www.vldb.org/pvldb/vol6/p445-hassanzadeh.pdf
Basically, apply the linkage rules sequentially in the order of their expected "quality" ("strength"/coverage).
Could you also have a simpler adaptation of the original GenLink algorithm as baseline? One could also imagine a decision tree based approach.

- In Algorithm 2 description, you say Algorithm 2 takes input of algorithm 2 (you mean 1?).

- Can you specify the details of the handwritten rules mentioned on page 11. Can you show examples or link to a supplementary material?

- A strong aspect of this paper is making available the source code, but the code has zero documentation. This is hardly any better than not making any code available, since someone not familiar with your code can hardly run it. I strongly encourage you to write down at the very least a simple "getting started" guide that allows me to run, for example, your running example in the paper. This makes your paper truly repeatable, and allows me to apply it in my similar application scenarios (without having to contact you for further details!).

- I strongly recommend you take part at the annual OAEI campaign at the Ontology Matching workshop. Your solution is very applicable to matching instances across heterogeneous ontologies and so your involvement in the ontology matching community can further improve state of the art.