Review Comment:
Summary
The paper addresses the important problem of entity resolution in situations when records have significant numbers of missing attributes. Traditional methods often treat missing values as non matches, assigning a similarity of zero.
The authors propose two ideas. One idea is to generate a family of rules tuned to subsets of attributes thus avoiding the issue of missing values in each rule. The second idea is to scale the weights assigned to each feature (comparison between attributes) so that stronger similarity between available attributes is required to consider a pair a match.
The first idea produces better recall as the matches from each rule are unioned, and the second idea produces better precision as it requires stronger similarity when values are missing. The authors then propose a method to combine both ideas to achieve better F score. The evaluations support the contribution of each of the ideas.
The authors present their ideas in the context of their previous work on GenLink. While this is convenient, and it is how the algorithms were implemented, the ideas are independent of the previous work, and their contribution would be stornger if they were cast as general ideas. THe GenLink implementation is an implementation of the general ideas.
The paper is intersting, and should be published after minor revisions.
(1) originality
The idea of searching for a family of rules is novel and intersting. The idea of adjusting the weights of the aggregation is intersting but simple, and not fully explored (eg, influence/sensitivity on beta parameter). The idea of combining the two approaches is original and interesting.
Overall, originality is good.
>> blocking: there is no mention of blocking, and it deserves at least some discussion as blocking is necessary for real world datasets, and blocking may also be affected by sparsity.
(2) significance of the results
>> influence of hyper-parameters:
hyper-parameters: c in rulecount, beta in selective aggregation.
How were the hyper-parameters tuned? Were they tuned for each dataset, or were the same setting used for all datasets? How were they tuned?
What is the influence of the hyper-parameters on the results?
>> learning times: there is no discussion of the learning times of the algorithms. This is important as genetic algorithms are very slow.
>> hand-written rules: the paper includes a vague reference for the rules being written by an expert. This is too vague as a determined user can write very good rules. In fact a determined user can write decision tree rules to deal with sparse values. Was this done? Would be good to list the hand written rule for one of the datasets.
>> configuration of baselines and other systems: for replicability, the configuration of the other systems should be discussed and perhaps presented. Was the default configuration used, was there an attempt to optimize it, eg, for FEBRL and EAGLE.
>> sparsification: random sparsification may unfarily hurt the machine learning (ML) approaches as as sparsity in real work datasets is not random and ML could identify it. As
>> aside notes:
The paper mentions the importance of learning non-linear rules, and interstingly the main examples are linear, first example in fig 2 and fig 3, there is only one example of max
>> comments on related work:
Machine learning has developed many techniques to deal with missing values. Examples include methods that tolerate missing values and missing value imputation techniques that work well for numeric values and categorical variables with few categories.
In traditional ML, a data scientist spends a significant amount of time developing features, selecting models and tuning hyper-parameters. The process leads to dramatic improvements over the first model that users try (eg, run SVM). A particular focus in this process is handling of missing values, through imputation, or by defining additional features (eg, a new indicator feature that specifies whether a value is present or not). While this process is not automated, a savvy data scientist can in a few hours achieve excellent results.
GenLinkSA uses a very simple formula to compute the score of a feature vector generated for a pair of records (using the beta parameter). It is likely that ML methods could learn a much better formula.
This possibility makes me think about the significance of the contributions. It is true that the simple algorithm presented in this paper improves the state of the art, but it also suggests that more sophisticated methods could do better. As this type of tuning is common in ML, it should be mentioned in the paper.
(3) quality of writing
The paper is clearly written for the most part, but does contain a number of grammatical errors that should be cleaned up.
>> The following sections are not clearly written in the sense that the ideas are not expressed precisely.
The GenLink overview in page 2 is somewhat vague. The authors refer the reader to the GenLink paper, but the overview should be precise. The crossover paragraph is too short and imprecise, eg "selects one operator at random in a parit of linkage rules" ... does this mean one operator in each rule? Do they have to be of the same kind (I suppose so). The next sentence talks about aggregation operators so it is not clear whether crossover applies to all 4 types of operators. A small amount of work can make the crossover section clear.
Group application: is the sorting of the rules done statically based on an analysis of the training data, or is the sorting of coverage done for each pair being tested? The examples suggests the second option, please clarify.
>> Some parts are written precisely, but not explained fully.
Equation 3 is complex and not explained.
Intuition of equation 4 is not provided, a sentence would be enough.
Equations 6 and 7, although precise seem overly complicated to express a simple idea. As it stands, these equations are hard to follow.
|