Review Comment:
The authors introduce a framework for correcting assertions in knowledge bases. Various components can be plugged in both for generating as well as for scoring candidate replacement statements. The approach is evaluated on three different datasets.
I have two major points of critique on the paper, which I would like to see addressed in a revised version.
The first point concerns the selection of components for candidate generation and for candidate scoring. While the overall approach works, the selection of components and their assignment to the two stages could also have been done in another way. For example, the authors generate candidates by finding entities with similar names, and then score them based on connecting paths. This could also have been done the other way around: considering all entities in a two-hop neighborhood as candidates, and then scoring them by syntactic similarity. The same holds for most of the other components assigned to either one of the two stages. Here, I would like to propose either to provide more evidence (e.g., by computing the recall graphs in Fig. 3 also for the other approaches, and showing that the ones chosen for the candidate generation stage actually have the highest recall@k), or to give a clearer argumentation why the components were assigned to the stages in the way given, e.g., based on a thorough analysis of classes of errors and their frequency.
The second point concerns the evaluation. In my opinion, the evaluation metric chosen is too optimistic. The authors compute performance (correction rate, empty rate, accuracy) only on the subset of errors where there is a suitable replacement candidate for a literal. However, when applied to a real world knowledge graph, the error rate on the total set of statements would be interesting. For example, on the DBP-Lit dataset, there are 725 target assertions, 499 out of which have a GT entity. This means that for the remaining 226 ones, it is best not to replace the literal (i.e., it is desired to have a low empty rate on the literals with a GT entity, but a high empty rate on the rest).
The way the authors report accuracy overestimates the results. For example, if an approach always replaces a literal at an accuracy of 80%, it would also replace all the 226 entites erroneously. In that case, the error rate would be (20%*499 + 226)/726 = 44.9%, which corresponds to an accuracy of only 55.1%, not 80%. Hence, I would like to propose a fairer evaluation, which also takes into account those statements without a GT entity.
Further comments and questions:
Section 4.2.1: lexical matching is named as a technique for creating candidates, using edit distance. I am not sure how this is implemented, but searching through an entire large-scale knowledge graph for finding entities which have a small edit distance to an entity at hand sounds pretty costly. Are there any heuristics and/or special index structures involved?
Section 4.2.2: the authors claim that "entity misuse is often not caused by semantic confusion, but by similarity of spelling and token composition". I would like to see some evidence for that statement. Semantic confusion is also not too rare, metonymy is a typical case here (e.g., using the city "Manchester" instead of the soccer team). So I would like to see some statistics here on the typical error sources. These could also help motivating the selection of candidate generation and scoring mechanisms, see above.
Section 4.2.3: to the best of my knowledge, the lookup service uses so-called surface forms, i.e., anchor text links extracted from Wikipedia, and scores based on the conditional probability that for a search string s, the surface form s links to an entity e. Later in section 5.2, the authors mention that DBpedia Lookup also uses the abstract of entity, which I think it does not (but I am not 100% sure either). The authors should double check the inner workings of DBpedia Lookup and clarify the description of the service.
Section 4.2.3: the description for merging to entity lists seems to assume that both have k results, but result lists in DBpedia Lookup may have different lenghts. How are lists of different length merged? Moreover, DBpedia Lookup also gives a score (called "RefCount" in the API), so I wonder why the score is discarded in favor of list position, instead of ordering the merged list by the score.
Section 4.3.1: Lines 9-11 in algorithm 1 could me more simply rephrased as E = E union {o| in script(E), o is an entity}.
Section 4.3.1: algorithm 1 seems to extract neighborhoods only with statements with the same predicate as the target assertions. For example, if my target assertion was , the neighboorhood graph would not contain, e.g., or . Is that really intended?
Section 4.3.2: At the point where sampling is done, there is already some relatedness/similarity notions in place. Did the authors also consider weighted sampling using those relatedness/similarity scores as weights?
Section 5.1.1: The authors describe the access to DBpedia, but I miss similar statements on how the other two datasets are accessed.
Section 5.1.1: "literals containing multiple entity mentions are removed" -> how exactly? what do you consider a multiple entity mention? for example, would "University of London" be considered a multiple entity mention, since it mentions both "London" and "University of London"?
Section 5.1.1.: likeweise, "properties with insufficient literal objects are complemented with more literals from DBpedia" -> how exactly is that done? For both this and the previous items, please provide a more detailed description of what is happening here, plus a discussion on how that eases/complicates the task at hand.
Section 5.2: could you also combine multiple related entity estimation mechanisms? what would be the results then?
Section 5.3.1: the comparison methods (like AttBiRNN) and their configurations should be explained more thoroughly. Likewise, how exactly is RDF2vec exploited as a baseline?
Overall, as can be seen from that list, there is a lot of open questions to this paper. I am confident that if those are addressed in a revised version, this paper will be a really interesting contribution to be published in SWJ.
Minor points:
p.1: canonicalizaiton -> canonicalization
p.15: MSU-Map -> MUS-Map
|