Learning Expressive Linkage Rules from Sparse Data

Tracking #: 1831-3044

This paper is currently under review
Petar Petrovski
Christian Bizer

Responsible editor: 
Jérôme Euzenat

Submission type: 
Full Paper
Identifying entities in different data sources that describe the same real-world object is a central challenge in the context of the Web of Data as well as in data integration in general. Due to the importance of the problem, there exists a large body of research on entity resolution in the Linked Data as well as in the database community. Interestingly, most of the existing research focuses on entity resolution on dense data, meaning data that does not contain too many missing values. This paper sets a different focus and explores learning expressive linkage rules from as well as applying these rules to sparse data, i.e. data exhibiting a large amount of missing values. Such data is a common challenge in the context of the Web as different websites describe entities at different levels of detail. We propose and compare three entity resolution methods that employ genetic programming to learn expressive linkage rules from sparse data. First, we introduce the GenLinkGL algorithm which learns groups of linkage rules and applies specific rules out of these groups depending on which values are missing from a pair of records. Next, we propose GenLinkSA, which employs selective aggregation operators within rules. These operators exclude misleading similarity scores (which result from missing values) from the aggregations, but on the other hand also penalize the uncertainty that results from missing values. Finally, we introduce GenLinkComb, a method which combines the central ideas of the previous two into one integrated method. We evaluate all methods using six benchmark datasets: three of them are e-commerce product datasets, the other datasets describe restaurants, movies, and drugs.We show improvements of up to 16% F-measure compared to handwritten rules, on average 12% F-measure improvement compared to the original GenLink algorithm, 15% compared to EAGLE, and 8% compared to FEBRL.
Full PDF Version: 
Under Review