Deep learning for noise-tolerant RDFS reasoning

Tracking #: 2136-3349

Bassem Makni
James Hendler

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
Since the 2001 envisioning of the Semantic Web (SW) [1] as an extension to the World Wide Web, the main research focus in SW reasoning has been on the soundness and completeness of reasoners. While these reasoners assume the veracity of the input data, the reality is that the Web of data is inherently noisy. Although there has been recent work on noise-tolerant reasoning, it has focused on type inference rather than full RDFS reasoning. Even though RDFS closure generation can be seen as a Knowledge Graph (KG) completion problem (link prediction in particular), the problem setting is different—making KG embedding techniques that were designed for link prediction not suitable for RDFS reasoning. This paper documents a novel approach that applies advances in deep learning to extend noise-tolerance in the SW to full RDFS reasoning; this is a stepping stone towards bridging the Neural-Symbolic gap for RDFS reasoning and beyond. Our embedding technique—that is tailored for RDFS reasoning—consists of layering RDF graphs and encoding them in the form of 3D adjacency matrices where each layer layout forms a graph word. Each input graph and its entailments are then represented as sequences of graph words, and RDFS inference can be formulated as translation of these graph words sequences, achieved through neural machine translation. Our evaluation confirms that deep learning can in fact be used to learn the RDFS inference rules from both synthetic and real-world SW data while demonstrating a noise tolerance unavailable with rule-based reasoners; learning the inference on the LUBM synthetic dataset achieved 98.4% validation and 98% test accuracy while it achieved 87.76% validation accuracy on a subset of DBpedia.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Heiko Paulheim submitted on 08/Mar/2019
Minor Revision
Review Comment:

The authors have addressed most of my comments in a very good and adequate way. The revision is substantially deeper and better understandable, and some problematic claims have been toned down appropriately. I do like that work a lot, but I have one more remark that I would like to see addressed before publication.

My main suggestion for a minor revision is: the authors acknowledge the existence of noise-tolerant type prediction approaches (such as SDType), but discard them as baselines as their approach is more versatile. While I can buy that argument by the way the approach is constructed (it is, in fact, capable of predicting arbitrary relations, while type prediction methods are not), it is not clear whether it actually does that in practice. To see that validated, I would like to ask the authors to provide some base statistics about the ground truth inference graphs: which fraction in them is type statements, which fraction is relation assertions? Which fraction of inference graphs does contain relation assertions at all? The same should be done for the results of the proposed approach, i.e., report not only aggregate results per triple, but also recall/precision on types and non-type relations in separately.

The reason behind this request comes from listing 2 in the appendix. Here, 11 out of 12 assertions are type assertions. Hence, a strong type prediction system could achieve 91.7% recall and 100% precision on the inference task, and would thus actually be very competitive. In fact, it would be possible (although I do not really think so) that the authors mainly learned a strong type predictor, and the missing percentages in recall are only due to other relations.

Given my request above for some base statistics for the benchmark datasets: in case the fraction of type statements in the inference graphs is not significantly lower than the reported per-triple-recall of the proposed approach, there is actually no justification for ruling out noise-tolerant type prediction approaches as a baseline.

Do not get me wrong: I do not strongly believe that the proposed approach is empirically outperformed by type prediction, but the the current presentation of the results just does not prove that seamlessly.

My remaining points of critique are more light-weight:
* I would kindly ask the authors to provide access to their code and data (despite the construction for the latter being described rather clearly).
* I would like to see a discussion if it plays a role that the reasoning with RDFS rules is monotonous (esp. when the extension to OWL reasoning is mentioned in the outlook)
* The hierarchy in Fig. 1 could be understood in a way that the noise being propagable or not is a property inherent to the noise, but it actually depends on the reasoning used (in this case: which RDFS chaining rules are used). This should be clarified
* In Fig. 7, why do non-propagable errors (RAT, GCC, WOAD) lead to worse results than propagable errors?
* Can you explain the drop in Fig. 10 around epoch 65? It would be interesting to look into the model before the drop, the model at the drop (which has an accuracy of 0.3), and the best model after the drop: what exactly are the errors made by the bad model? and does it converge back to the same model or a different one?

Some small issues:
* I think on p. 1, 41, the authors rather refer to information/relation extraction than named entity linking
* As the authors quite frequently refer to individual RDFS chain rules, these should be listed in a table to make the paper self-contained (the appendix already has tables with special properties of those rules, but not the rules as such)
* The label of the y axis in Fig. 6 is obscured

Review #2
By Dagmar Gromann submitted on 11/Mar/2019
Minor Revision
Review Comment:

I would like to thank the authors for making the manuscript very readable and their very interesting work. However, there are still several small issues in the description of the approach. For the neural network and the KG embedding training phases, there are no architecture details (number of layers, units per layer, activation function, etc.), framework used (pytorch, tensorflow, etc. or did you use any existing model such as Fairseq?) and no hyperparameters (learning rate, embedding dimensionalities, vocabulary size, etc.) provided. This, together with the fact that neither code nor data are published, makes it very difficult if not impossible to reproduce the proposed approach. Thus, I encourage the authors to include all these very important details (it is even mentioned that the LUBM1 model has different hyperparameters but they are not provided) and publish their data and code (and include a link to both to the paper, even if the code is not perfectly polished - this can be updated in the following months). Also the training results are not discussed - how come the validation accuracy suddenly drops radically around epoch 70 for the scientist dataset? In a nutshell, while the remainder of the paper is well readable and in a good shape, sections 6 and 7 require substantially more details and analysis.

Please also ensure that your manuscript adheres entirely to the style guidelines. For instance, the abstract may not have more than 200 words, references to figures in the text are Figure 3 and not Fig. 3, Listing 1 is capitalized (also in the Appendix), remove line numbers for final version, etc.

There are also several inconsistencies in the paper:
- non propagable vs. non-propagable => should be latter always
- super class vs. super-class => super-class
- use of dashes => embedding- vs. embedding - and -- vs. - => this should be one type of dashes where hyphen-ation is different from dashes, that is blank space between dash and words and of one type (-- or -)
- references should not [number] and not ([number]) -> for literal quotes please include the page numbers

Minor remarks:
- Section 2.2. line 45 - ref to Appendix A?
- Section 3 line 33 DBPedia[7] => white space
- Section 3.1.1 "taxonomy in draw in" ???
- Section 5.1. line 10 full stop missing
- Appendix: Proposition 1 in Appendix G, Corollary 1 in Appendix G - otherwise not clear where