Learning Expressive Linkage Rules from Sparse Data

Tracking #: 1964-3177

Petar Petrovski
Christian Bizer

Responsible editor: 
Jérôme Euzenat

Submission type: 
Full Paper
A central problem in the context of the Web of Data, as well as in data integration in general is to identify entities in different data sources that describe the same real-world object. There exists a large body of research on entity resolution. Interestingly, most of the existing research focuses on entity resolution on dense data, meaning data that does not contain too many missing values. This paper sets a different focus and explores learning expressive linkage rules from as well as applying these rules to sparse web data, i.e. data exhibiting a large amount of missing values. Such data is a common challenge in various application domains including e-commerce, online hotel booking, or online recruiting. We propose and compare three entity resolution methods that employ genetic programming to learn expressive linkage rules from sparse data. First, we introduce the GenLinkGL algorithm which learns groups of matching rules and applies specific rules out of these groups depending on which values are missing from a pair of records. Next, we propose GenLinkSA, which employs selective aggregation operators within rules. These operators exclude misleading similarity scores (which result from missing values) from the aggregations, but on the other hand also penalize the uncertainty that results from missing values. Finally, we introduce GenLinkComb, a method which combines the central ideas of the previous two into one integrated method. We evaluate all methods using six benchmark datasets: three of them are e-commerce product datasets, the other datasets describe restaurants, movies, and drugs. We show improvements of up to 16% F-measure compared to handwritten rules, on average 12% F-measure improvement compared to the original GenLink algorithm, 15% compared to EAGLE, 8% compared to FEBRL, and 5% compared to CoSum-P.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 29/Jul/2018
Review Comment:

Addendum to original review: The authors addressed my comments well, and I see that they also addressed the detailed comments provided by the other reviewers. The detailed explanation of the changes was very helpful in assessing the improvements. I think the paper is now ready for publication.

Review #2
By Oktie Hassanzadeh submitted on 06/Oct/2018
Review Comment:

The authors have very carefully addressed all my comments and answered all my questions.

Review #3
Anonymous submitted on 24/Nov/2018
Minor Revision
Review Comment:

I thank the authors for the systematic approach chosen to answer the reviewers' comments. I especially appreciate the authors' efforts in running Zhu et al.'s algorithm. My previous comments as to the originality and significance of the results remain. The quality of writing has been improved but I would still suggest that the authors read through their paper carefully. Please find my comments below:

Answers to answers
- R1C2: The formulation is still incorrect and should read "The evaluation of GenLink has shown that the algorithm delivers good
results with F-measures above 95\% on different dense datasets such as sider-drugbank, LinkedMDB, restaurants [19].
- R1C5: I'm afraid the meaning of M is still not clearly described. The paper reads "the subset M consisting of all pairs" and does not state what M is a subset of. Do you simply mean set? Please fix.
- R1C6: The text now reads "a relation ~R". Do you mean owl:sameAs? If yes, please say so. If not, please define ~R formally.
- R1C14: "removing (or loosening) the penalty has the potential to result with overfitted model and thus would not improve results" Do you mean "removing (or loosening) the penalty has the potential to result in an overfitted model. Thus, it might not improve the results of our approach"? Please check.
- R1C15d: While I agree with the potential for improvement as to the runtimes and would like said fact to be added to the discussion, I'm afraid I disagree with the authors w.r.t. reporting runtime results. It is only fair to state runtimes (and the hardware on which the runtimes were achieved) so that other researchers know what to expect when aiming to use or extend your approach. Hence, I must repeat my request for a table of the runtimes one the different datasets being added to the paper. It'd be especially helpful if the runtimes of the other algorithms were added as well.
- R1C21: I would argue that seeing the manual rule would actually help the reader get an idea of the complexity of the problem at hand. Hence, I'd still suggest that it is added.
- R1C28: The authors seem to mix up the LIMES framework and the LIMES algorithm. The paper describing the framework is (Ngonga Ngomo, 2012) and I guess the authors mean the framework when they talk about LIMES.

- Equation 6. Do you mean $|\cup_{i=1}^|G| tp_i|$? You redefine i on the top of your sum, making your equation incorrect. Please check.
- Equation 7. See Eq. 6.
- How do you know when to switch to a larger value of $c$. Any insights?
- Again, you use * and \times to mean multiplication (see Eq. 5 and the fitness_group equation). This was already pointed out in my last review and correction was claimed. I would be thankful if the authors could check their manuscript anew for such inconsistencies.

- Table 4: Please add the variable name to the labels (e.g., \alpha, \beta)
- The authors train on 66% of the data and evaluate on 33%. Why do they not use the standard protocol of n-fold validations? In all other cases, it could be that their results are just an artifact of the slices chosen (which is unlikely given the significance of their results but should still be checked). Please clarify or switch to an n-fold cross-validation. Any reason for the 3-fold instead of 10-fold validation commonly used?
- Statistical test. The authors do not describe what they test exactly. Do you compare the average F-measures achieve over 10 runs or do you compare single runs?
- " Correspondingly, the GenLinkSA algorithm gives comparable results in F-measure compared to FEBRL, mostly due to the jump in precision". It seems to me that your approach is commonly better than FEBRL. Please check.

"Additionally, find" => Additionally, we find [Do you mean we compute. You don't really need to find U once you have M]
Please mind the punctuation after your equations.
Footnotes should be placed after punctuation marks.