Review Comment:
(References in the review are those from the paper).
Overview
========
The submission presents an adaptation of previous work of the authors [13,14] to learning SHACL constraints from knowledge graphs (KG).
The learning mechanism is based on discovering paths (ie sequences of properties) that occur frequently starting from a given node type (eg song, human, city, etc.)
These paths can then be merged into a single SHACL constraint for the given type.
The paper starts with recalling in Section 2 the definition of a closed path (CP), which is a dependency of the form (all variables are universally quantified)
P(x,y) <- P_1(x,z_1), ... P_n(z_{n-1},y)
stating that if a path P_1 ... P_n exists from x to y, then P(x,y) is a fact in the KG.
Then defines an open path (already defined in the tech report [13]) as a dependency of the form
P(x,z_0) <- P_1(z_0, z_1), ..., P_n(z_{n-1}, y)
which I do not fully understand because the quantifications are not explicitly given.
The authors say that this rule contains free variables but I do not know which variables are free. x? z_0 ? y?
In Section 3 the authors introduce inverse open path (IOP) as a rule of the form
P(x,z_0) -> \exists(z_1, ..., z_{n-1},y). P_1(z_0, z_1), ..., P_n(z_{n-1},y)
which states that whenever P(x,z_0) is a fact in the KG, there exists a path P_1 ... P_n starting at z_0 in the KG.
IOP rules correspond to SHACL constraints when the variables x and z_0 are unified and P is a node type.
For instance,
song(x,x) -> \exists (z_1,y). album(x,z_1), recordLabel(z_1,y) (*)
is a IOP rule and corresponds to a SHACL constraint saying that nodes of type song have a property album which value is a node that has a property recordLabel.
Several rules with common body, eg song(x,x), can be combined in what the authors call a tree to yield a more complex SHACL constraint.
The learning algorithm called SHACLearner is intuitively based on discovering correlations between the occurrences of the body of the rule (eg song(x,x)) and the head of the rule (eg existence of a path album.recordLabel starting at x).
It is also possible to discover minimal cardinality constraints by counting the number of different y's that allow to satisfy the rule (*) above.
The SHACLearner algorithm is given in Section 3 together with a brief explanation from which I was not able to understand several important aspects (see below).
I could however see that the algorithm is very close to the ones in [1,2]. The main difference seems to be the construction of the tree (which is clearly a contribution of the paper under review), and possibly the evaluation of the relevance of discovered rules is also different. The authors do state that the algorithm is an adaptation of the OPRL algorithm from [14], but I would have preferred to see explicitly which parts are previous work and which parts are contributions of the paper under review.
Then in Section 4 the authors give some related work, for which I have a couple of remarks (see below).
In Section 5 is presented an experimental evaluation of the algorithm with the KGs YAGO, DBPedia and Wikidata. The algorithm is used in each KG with 50 randomly selected node types (eg song in example (*)), with predefined quality thresholds for the rules.
The experiments show that the algorithm finds very quickly (less than 3 hours) at least one quality IOP rule for 80 to 98 percent of the types.
Other figures show the number of quality tree rules discovered by combining IOPs.
Finally some conclusions are drawn.
Evaluation
==========
The experiments show that SHACLearner is able to discover very quickly rules in huge knowledge bases, which is IMHO a very interesting feature.
The paper is well written and generally clear (with some exceptions listed below).
I however have some concerns regarding the usefulness and the relevance of the discovered rules, with two aspect that I will develop in the sequel.
Completeness of SHACLearner
---------------------------
Does the algorithm discover all quality rules satisfied in the KG ? If not, how many of the quality rules are missing in the final result (probability to discover a quality rule) ? What kind of rules are missing ?
SHACLearner prunes the graph and uses heuristics, which IMHO is necessary in order to handle huge KGs. It is however not clear whether pruning or heuristics discard relevant rules.
Regarding the pruning, p. 6 l. 37 it is said that "entities and predicates that are not relevant for the target predicate" are removed. Without an explanation of how you decide relevance, it is impossible to judge whether pruning indeed removes only irrelevant entities and predicates.
Also, SHACLearner discovers frequent paths by exploring all possible paths in the KG, which of course requires huge amount of computation. A heuristic is used in order to reduce the search space, based on so called similarity. Similarity is not defined, only intuitively given. As far as I can understand, a path P_1(y,z),P_2(z,t) has higher score (so more chance to be kept by the heuristic) if the set of entities P²_1 that appear in second position of facts of P_1 is "similar" to the set of entities P¹_2 that appear in first position of facts of P_2 (all these entities correspond to valuations of the variable z). Although I do not know what exactly "similar" means here, I can imagine cases where a path is relevant while the sets of entities P²_1 and P¹_2 have very small intersection. For instance, if P_1 and P_2 are very frequent properties that can appear virtually everywhere (so they have very small similarity ?), but the path P_1.P_2 appears for ALL entities of some type T. Does the definition of similarity allow to discover the corresponding rule ?
Here is a suggestion of how one could possibly discover rules discarded by the above mentioned approximations: take a small KG and apply SHACLearner on it w/o any approximations, and with the approximations. Do you get the same rules ?
If yes, then nothing is proved (SHACLearner could still miss important rules on another KG). If no, then you have an indication on the rules that the approximation does not allow to discover.
Is is possible to quantify the precision / completeness of the algorithm ? Can you compute the probability that a rule satisfied in the KG is indeed discovered ?
Relevance of discovered rules wrt use scenarios
-----------------------------------------------
While IOP rules indeed can be encoded in SHACL, they only cover a part of it. They do not allow to encode complex cardinalities, data values, Boolean operators (negation and disjunction are complex to learn, I won't expect to find good algorithms for that).
In particular, the zero-or-one cardinality is very important while modeling, but SHACLearner never generates rules with zero-or-one cardinality.
How relevant are the discovered shapes ? Can you describe use cases or use scenarios in which precisely these shapes are relevant ?
In the Conclusion the authors do motivate the use of SHACL shapes for knowledge graphs, but are the shapes discovered by SHACLearner relevant for any of these tasks ? For instance, are rules discovered by SHACLearner relevant for generating forms for data completion ? In order to complete data one needs a reliable and approved schema. Can automatically discovered rules be used to decide that some data is missing in the graph ?
Also, SHACLearner is based on discovering paths of some length that are then combined. Does it often occur in practical applications to use paths in constraints ? IMHO the most frequent case is for the shape to describe the immediate neighborhood of the node, or maybe up to distance 2 or 3. What use scenarios justify the use of such a complex algorithm as SHACLearner that explores the huge search space of all paths up to some length ? When is the exploration of just the immediate neighborhood not enough for discovering a 'good' shape ?
Other remarks
=============
Related work
------------
While I didn't see important references missing, I find unfair the comparison with some of the related work.
- p. 9 l. 27-34, comparison with [31]: I wouldn't say that [31] is "partial attempt" to provide logical foundations of the semantics of SHACL. It is I think quite complete, and in any case a more complete proposal of the semantics than the rules presented in the paper under review. Btw it could be interesting to compare the rules you propose with the formalization proposed in [31].
- p. 9 , first paragraph of Section 4. The related works [8,22-25] are qualified "without logical foundation" and their output formalisms are implicitly qualified as not "well-founded". While I myself like to use logic formalisms in order to define things w/o ambiguity, I wouldn't say that other kinds of definitions are not well-founded. As the main author of [25] I claim that the output of the shape inference algorithm there is very precisely qualified (see also the companion work [**]): it is a precisely defined generalization to SHACL rules of the neighborhoods of a set of randomly chosen sample nodes. No probabilities, no heuristics, but an outspoken random sampling with clearly identified use scenarios: help expert users to write schehmas, help novice users to get some idea of the shape of the data.
Miscellaneous
-------------
- p. 2, right, l. 46: it is said that the rule forms a "loop". The standard definition of a loop I know of is a path that starts and ends with the same element. It is not the case here. What kind of loop is it ?
- Definitions of OP and IOP rules, I found it a bit disturbing to distinguish the variable y from the z variables, as in the definitions of satisfies and support, SC and HC y is also existentially quantified. Later on I could see that y is relevant for the cardinality. Maybe this could be mentioned right away, to clarify why it treated differently from the z's.
- p. 12, right, l. 25: what you call %SucRate was previously called %SucTarget
- I'm curious about the matrix representation used to explain the efficiency of the algorithm. With the size of the KBs considered, you have matrices of size n x n with n=3 000 000. Can it even fit in RAM (even if you represent 1 Boolean value with 1 bit) ? Do you use sparse matrices ? What kind of matrix multiplication algorithm is being used ?
Requested revisions
===================
Explicit contributions
----------------------
- Clearly state what are the new contributions of this paper and what comes from previous work, especially in Section 3 and the SHACLearner algorithm.
Better formalization and explanations
-------------------------------------
- State precisely what are the external components that are used by SHACLearner (eg RLvLR, RESCAL) and what are their properties. This includes
- equation (5): what are the formal properties of the scoring function ? What precision does this scoring function have, i.e. what guarantee does the value computed in (5) give on the actual probability of P_0(e_1, e_2) ?
- what is similarity (mentioned on p. 6 l.48) ?
- Explain what is the sampling or at least what are its formal properties so that one can judge whether sampling would remove relevant parts of the KG.
Motivation
----------
- Qualify and/or quantify the output of the SHACLearner algorithm. What kinds of paths might not be discovered ? What is the probability that an actual frequently occurring path is discovered by the algorithm ?
- Identify use scenarios: what use cases can be supported by SHACLearner, in particular in comparison with simpler algorithms ?
Additional citation
===================
[**] Semi Automatic Construction of ShEx and SHACL Schemas
Iovka Boneva, Jérémie Dusart, Daniel Fernández Álvarez, Jose Emilio Labra Gayo.
https://arxiv.org/abs/1907.10603
|