Review Comment:
The authors have addressed most of my comments in a very good and adequate way. The revision is substantially deeper and better understandable, and some problematic claims have been toned down appropriately. I do like that work a lot, but I have one more remark that I would like to see addressed before publication.
My main suggestion for a minor revision is: the authors acknowledge the existence of noise-tolerant type prediction approaches (such as SDType), but discard them as baselines as their approach is more versatile. While I can buy that argument by the way the approach is constructed (it is, in fact, capable of predicting arbitrary relations, while type prediction methods are not), it is not clear whether it actually does that in practice. To see that validated, I would like to ask the authors to provide some base statistics about the ground truth inference graphs: which fraction in them is type statements, which fraction is relation assertions? Which fraction of inference graphs does contain relation assertions at all? The same should be done for the results of the proposed approach, i.e., report not only aggregate results per triple, but also recall/precision on types and non-type relations in separately.
The reason behind this request comes from listing 2 in the appendix. Here, 11 out of 12 assertions are type assertions. Hence, a strong type prediction system could achieve 91.7% recall and 100% precision on the inference task, and would thus actually be very competitive. In fact, it would be possible (although I do not really think so) that the authors mainly learned a strong type predictor, and the missing percentages in recall are only due to other relations.
Given my request above for some base statistics for the benchmark datasets: in case the fraction of type statements in the inference graphs is not significantly lower than the reported per-triple-recall of the proposed approach, there is actually no justification for ruling out noise-tolerant type prediction approaches as a baseline.
Do not get me wrong: I do not strongly believe that the proposed approach is empirically outperformed by type prediction, but the the current presentation of the results just does not prove that seamlessly.
My remaining points of critique are more light-weight:
* I would kindly ask the authors to provide access to their code and data (despite the construction for the latter being described rather clearly).
* I would like to see a discussion if it plays a role that the reasoning with RDFS rules is monotonous (esp. when the extension to OWL reasoning is mentioned in the outlook)
* The hierarchy in Fig. 1 could be understood in a way that the noise being propagable or not is a property inherent to the noise, but it actually depends on the reasoning used (in this case: which RDFS chaining rules are used). This should be clarified
* In Fig. 7, why do non-propagable errors (RAT, GCC, WOAD) lead to worse results than propagable errors?
* Can you explain the drop in Fig. 10 around epoch 65? It would be interesting to look into the model before the drop, the model at the drop (which has an accuracy of 0.3), and the best model after the drop: what exactly are the errors made by the bad model? and does it converge back to the same model or a different one?
Some small issues:
* I think on p. 1, 41, the authors rather refer to information/relation extraction than named entity linking
* As the authors quite frequently refer to individual RDFS chain rules, these should be listed in a table to make the paper self-contained (the appendix already has tables with special properties of those rules, but not the rules as such)
* The label of the y axis in Fig. 6 is obscured
|