Review Comment:
This paper presents a technique for answering SPARQL queries under RDF entailment. The technique uses precomputed indexes that identify candidate entities that can map to the variables in the query, after which a filtering technique is used to compute the query answers.
This paper suffers from numerous problems, all of which make the paper unsuitable for publication in SWJ. Moreover, I do not believe that additional revisions can help improve this paper. In particular, I will next present in detail many technical issues, and I believe that to address them would require changing the approach and the experimental setting so much that the result would be a completely different paper. Therefore, I believe that the appropriate course of action is to reject this paper.
The following list summarises my points of criticism at a high level.
* The presented algorithms overlook many important details and, in fact, do not seem to be able to handle all SPARQL BGP queries. Please note that I am not complaining about not handling all of SPARQL; rather, the approach does not seem to be able to handle triple patters of certain forms (such as those containing variables in class or property position, patterns with multiple properties between pairs of variables, or patterns without rdf:type edges). In my detailed comments I give specific examples of places that seem to imply this.
* The paper has been poorly written and is unclear in many places. Formal definitions are used inappropriately. Many details of the presented approach are left unclear.
* The experimental setting seems to be fundamentally flawed. The authors compare the performance of their system with systems that were not designed to support reasoning with RDFS; to overcome this problem, the authors use a reformulation approach when running queries in competitor systems. As I discuss in detail below, this does not facilitate a fair comparison.
I will next present my detailed comments about the paper. I will introduce them in the order in which I found them in the paper (rather than in the order of importance).
--------------------------------------------------------------------------------
- page 1, line 19: "solution" -> "solutions"
- page 1, line 21: "handing" -> "handling"
- page 1, line 22: "feature" -> "features"
- page 1, line 22: "feature" -> "features"
- page 1: change last sentence to "one may use the following SPARQL query:"
- The query on page 2 is syntactically incorrect: should not have angle brackets around it.
- The first paragraph on page 2 does not make much sense to me. What is a relational solution to RDF data management? Do the authors mean “a solution where RDF data is stored in a relational table”? If that is what is meant, then it would be good to be explicit about it. Also, graph-based solutions are not inherently different from relational solutions: all databases I am aware of use various join operators and index. For example, the RDF-3X is not really all that different conceptually from a standard database: it uses hash and merge joins just like a standard database would do, and it uses B*-tree indexes; however, these are adapted to the specifics of RDF. So while there are some differences in how these functions are implemented in generic relational systems and systems that specifically target RDF data, there are sufficient similarities between all of these approaches so that the separation into "relational" and "graph-based" is rather meaningless.
- The second paragraph on page 2 also does not make much sense to me. In what sense is SPARQL easier? In fact, easier than what? I need to point out that it is well known that logic-based perspective on query languages (which is what SPARQL is: it uses variables to formulate a logical expression) is equivalent in terms of expressivity to relational algebra. Moreover, "graph pattern matching" is nothing else than computing joins in a relational sense: every BGP can be translated into a select-project-rename-join query of relational algebra. The computational complexity of “graph pattern matching” and SPRJ-algebra is identical, and the methods to solve these two problems are largely the same. To me, this just shows how the distinction between "relational" and "graph-based" approaches to RDF data management is really arbitrary. The main difference is that RDF systems instantiate a generic database architecture in a way that is more amenable to RDF, that is it!
- In paragraph 2 on page 2, the authors go on to claim that graph-based system have superior performance because of "graph pattern matching". This is not the case: the better performance is due to how various components of a DB architecture are implemented (e.g., indexes and so on).
- page 2, line 28: what is the difference between "query evaluation" and "query answering"?
- page 3, line 33: "experiment results" -> "experimental results"
- page 3, line 45: What is the difference between an entity and a class? In RDF, there is no such difference: RDF just talks about resources and knows nothing about classes.
- The formalisation of an RDF graph in Definition 1 is incorrect in that it overlooks that RDF graphs can contain several edges between the same pair of resources. Specifically, note that E_G is defined to be a *set*, which means that it cannot contain any duplicates; moreover, function L_G is defined to map each edge in E_G to a single property (in fact, the authors say "The label of an edge is its corresponding property"). Therefore, this definition cannot capture an RDF graph that contains triples and : set E_G would contain one edge , but what would function L_G map this edge to? If we define L_G() = P1, then we cannot represent the second triple; and if we define L_G() = P2, then we cannot represent the first triple.
- The mention of "is a valuable feature" in Definition 2 is bad style: definitions should be used to make mathematical statements and not imply judgment about the value of a feature. In fact, as such Definition 2 does not say much -- that is, it does not introduce any formal notions. Therefore, this text should not be a definition.
- Definition 3 is too imprecise to be a definition. To use a (formal) definition, one would actually spell out the entailment process with all its formal details.
- Definition 4 suffers from the same problem as Definition 2: it cannot represent queries containing several triples that mention the same pair of variables but with different labels. In addition, Definition 4 is unclear about the range of L_Q: is the range supposed to include just resources or can it include variables? For example, to represent BGP <?s, ?p, ?o>, function L_Q should map the edge <?s,?o> to the variable ?p.
- Definition 5 seems to answer my question from the previous paragraph: variables on edges are not allowed -- the authors consider only the case where an edge in a query is mapped to a property. This is a significant drawback in terms of expressivity of this approach, but the authors never really consider this drawback explicitly. In my opinion, significant restrictions such as this one should be prominently stated in the introduction (and possibly even the abstract) in order to not misrepresent the contribution.
- Definition 6 is strange. First, a sentence such as "It is important to keep in mind the distinction between query evaluation and query answering" does not belong in a formal definition: the purpose of definitions is to precisely introduce mathematical concepts needed to present the work. Moreover, the distinction between "query evaluation" and "query answering" goes against the common practice in database literature, where the two terms are generally considered synonyms. What authors (erroneously) call "query answering" is usually called "query answering under constraints" (or, equivalently, "query evaluation under constraints"). Changing the meaning of well-known terms from a research area is not good practice, and it just makes understanding the work harder.
- page 5, line 28: "that is adjacent" -> "that are adjacent"
- The paragraph "Offline index construction" on page 5 is quite hard to follow. At this point, the structure of STP is not known, so I cannot understand what is meant with "we merge the entity vertices in the RDF graph that is adjacent to equivalent class vertices".
- On page 6, the authors introduce the STP index. From this description, I could not understand an important point: do these indexes just contain the relationships that are explicitly present in the data, or do they contain also implicit relationships? If it is the former, note that virtually all RDF systems maintain various kinds of indexes that can easily facilitate lookups such as described on page 6. For example, the RDF-3X system maintains six indexes, one for each permutation of s, p, and o. Thus, the answer to SubPro(p) can be obtained by querying the pos index with the word rdfs:subProperty,p. This is, of course, assuming that only the input data should be indexed.
- In contrast, if STP is supposed to contain the implicit relationships, then I believe the authors have overlooked an important fact: relationships in RDF graph are often cyclic, and the algorithm that is (rather informally) described on page 6 does not seem to capture these correctly. For example, consider an RDF graph that contains the following triples:
Neither of these classes can be designated to be a "superclass vertex": each one of them has a superclass. Moreover, it is unclear how the proposed approach handles this situation because is it not precisely specified by what is meant by "to count the times of T that is extracted" -- extracted from where? If "extracted" means that one pass over the input is made, then this description is too complicated: one could simply say "We make a single pass over the input and, for each vertex T_s, we compute num(T_s) as the number of triples of the form ". In the above example, note that num(A) = num(B) = num(C) = 1, no none of them is a superclass vertex. Finally, note that num(T) can be zero, in which case w(T) is undefined because it involves division by zero.
- page 6, line 52: I could not understand the sentence "Specially, if u has no type, we can use STP to derive corresponding type of u." What does "u has no type" mean exactly -- has no type where? Also, if u has no type, how can we derive its type?
- There is another point about the definition of the concept graph on page 6 that I find confusing. As far as I could understand, each node in the concept graph correspond to one class -- that is, L_c(U_c) is just one class. Moreover, the set U_c is supposed to be a partition of the entities from the input graph. This, however, means that a concept graph cannot handle cases where one entity belongs to two unrelated classes. For example, consider the following RDF graphL
Note that classes A and B are unrelated, so they are "superclass vertices". Thus, the concept graph would contain two nodes, U_A and U_B, each representing entities belonging to classes A and B, respectively. But then, where are we going to put x? The authors say that U_A and U_B must partition the set U of all entities of the input graph. Thus, we can add x to either U_A or U_B, but not both. However, if we add x just to to U_A or to U_B, we are losing information about class membership of x.
In addition, there is yet another problem with the definition of concept graph. The authors say that the nodes of a concept graph must partition the entities of the input graph, which I take to mean that each entity in the input graph must occur in exactly one node of the concept graph. (That is the usual meaning of "partition".) Now let us extend the above set of triples with another triple:
The concept graph still contains just U_A and U_B, but where are we going to put y and z? These entities do not have a class, so putting them into either U_A or U_B would be incorrect. Maybe this can be corrected by requiring U_c to be a partition of a *subset* of entities of the input graph -- I do not know. All this, however, just shows to me that the formulation of the approach is extremely messy and not of adequate quality for a journal such as SWJ: a top-quality publication should not contain such ambiguities.
- Terminology in Definition 8 is confusing. The authors say "A semantic abstracted graph is multi-grade concept graph", which then they explain to mean that a "semantic abstract graph" consists of a family of graphs. But then, the name "semantic abstract graph" is misleading: it suggests that there is just one graph to consider. It would be better to call this something like "semantic index".
- page 6, line 28: "trip" -> "triple"
- I found Definition 8 very confusing and hard to follow. What is a child-class? This term was never introduced properly. Moreover, the notation used here is very hard to follow. Nevertheless, after some thinking and looking at Figure 5, I think I get the idea: the idea is to start with the top-level classes in the hierarchy and split the entities for each subclass level. Note that this cannot work correctly with cyclic class hierarchies such as what I outlined above.
- Function SplitMerge is used in Algorithm 1, but is not introduced anywhere in the paper. Thus, I can only guess as to what it does. On page 8 the authors argue about the complexity of this function, but this point is moot since I do not know what the function does.
- I think I understand the idea behind multi-grade filtering on page 8: the idea is to assign to each variable in the query a set of candidate entities so that the edges in the query (irrespective of their labels) can be embedded in a concept graph. This algorithm, however, will be incomplete for several reasons.
First, what if a query does not type each variable? For example, the following SPARQL query does not contain rdf:type edges, so I do not understand what are going to be the candidate for the variables ?X and ?Y:
SELECT ?X ?Y WHERE { ?X :R ?Y }
Second, SPARQL queries can contain variables in class positions. For example, the following is a valid SPARQL query:
SELECT ?X ?Y WHERE { ?X rdf:type ?Y }
Since ?Y is a variable, I do not see how the algorithm from page 8 can process this query. I would like to stress that both examples of queries I give above are used in practice very frequently. I guess an approach that can handle a subset of SPARQL efficiently could still be of scientific interest, but then it is paramount to precisely describe the fragment of SPARQL that is being handled here.
- The approach presented in this paper seems to be closely related to the approach presented in the following paper:
Sairam Gurajada, Stephan Seufert, Iris Miliaraki, Martin Theobald: TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing. SIGMOD Conference 2014: 289-300
In this paper, the authors also create a kind of summary graph, which is used to identify candidate entities that can potentially match variables in the query. The authors should include a detailed comparison of their approach with this well-known approach. The paper I mention above is used in a distributed setting, but summarisation technique presented there can easily be adapted to a centralised setting too.
- The description of the saturation step on page 9 does not contain sufficient technical detail. To improve clarity, the authors should include a pseudo-code of their approach.
- page 9, line 42: "it's" -> "its"
- I could not understand how exactly the neighbourhood encoding works: the description on page 9 is not sufficiently precise and it uses confusing notation. To what exactly are the hash functions of a Bloom filter applied? I get a sense that they are applied to property name, but I am not sure and the paper is not sufficiently precise. Moreover, I could not understand whether properties in a graph have bit strings; if so, I could not understand how they are generated.
- On page 10, equation (7), I could not understand what "where Mi represents the set of matching results for the subgraph of Q that is induced by (v1, v2, ..., vi)" is. In particular, the authors do not specify what "subgraph of Q included by (v1, v2, ..., vi)" is, and they do not state clearly what "matching results" are.
- In fact, I could not understand most of Section 6: there is just not enough detail. Again, it would be good to present the matching algorithm in pseudo-code that would hopefully contain all the details about the cost model.
- To illustrate just one point of confusion, consider the first sentence of Example 4: "Consider the query pattern in Figure 7, and suppose that the vertices ?x and ?z have been matched." What does "have been matched" really mean? Does this mean "fixed to specific entities" or "we have determined the set of candidates"?
- Section 7 is equally not easy to follow.
- I have significant doubts about the experimental setup. In particular, in Section 8.2, it seems to me that the authors compare the number of answers to queries produced by their system with that of gStore, but this comparison does not make much sense to me. Unless I am mistaken, gStore never claims to perform any reasoning, so it is not surprising that GQA_RDF produces more answers. In that sense, the comparison is completely unfair or, at he very least, completely uninformative. Note that query answering under RDFS semantics is a well-defined problem that has a precisely defined answer on each input graph. Thus, any system that claims to do reasoning is *obliged* to produce the exact set of relevant answers for each query, or it cannot be taken seriously. In other words, comparing the number of answers between different systems is irrelevant: any system that does not give all answers is simply buggy and should be discounted. In other words, I do not believe that systems giving more answers are somehow "more useful": a system that claims to support RDFS reasoning *must* produce the exact set of answers, no more and no less.
- In Section 8.3, the authors compare the performance of their approach with that of gStore and Turbo_Hom++, and I have significant doubts about that too. Since gStore and Turbo_Hom++ do not natively support RDFS reasoning, the authors reformulated their queries to take RDFS reasoning into account. This, however, raises many important questions.
In particular, it is unclear how the reformulation has been performed. It is well-known that reformulation can significantly increase the size of a query, and this can significantly affect the performance of query answering engines. Moreover, for most queries the reformulation is not unique -- that is, there are many different ways to reformulate a query into another query that takes reasoning into account. The authors never specify which reformulation approach they used (there is no citation in the paper). However, it is well known that some reformulation approaches may produce queries that are easier for RDF engines to process than others. For example, an obvious way to reformulate a query is to replace each triple pattern of the form with a disjunction { } UNION { } UNION ... UNION { }, where C1, ..., Cn are all subclasses of C. If a query contains many triple patterns, the resulting query is a conjunction over a number of disjunctions, and such queries are typically very hard for standard RDF query engines because estimating the correct join order becomes difficult. However, such a query can be transformed into disjunctive normal form (i.e., into a union of BGPs), and answering that will generally be *much* easier for most RDF query engines. Now the transformation into disjucntive normal form can blow up the query size exponentially in the size of the RDFS schema, but there are numerous approaches to further reduce the size of the rewriting. In the research literature on OWL 2 QL, such techniques have been extensively investigated, and they could easily be adapted to the RDFS setting.
To summarise, I consider the comparison with gStore and Turbo_Hom++ seriously unfair. First, these systems were likely given queries for which they have not been specifically tuned. Second, the paper does not even produce sufficient detail to know what kind of queries were given to these systems.
The comparison would be fairer if the authors compared their work with a system that directly implements RDFS query answering. As far as I know, OpenLink Virtuoso is one such (commercial) system: it implements some form of SPARQL query answering over RDF graphs while taking into account reasoning. Being a commercial system, the creators do not specify what algorithms they use, but I can imagine their algorithms have been highly tuned. It would be interesting to see whether GQA_RDF can beat such a system.
I accept that comparing with commercial-grade system might be difficult. However, other measures should then be made to ensure fairness of comparison. A simple way to do this would be to switch reasoning off and compare the "base" performance of GQA_RDF to gStore and Turbo_Hom++. The candidate selection procedure described in the paper should still be effective. Moreover, such a comparison would be fair because all systems would be applied to problems they were specifically designed for.
Another way to facilitate a fairer comparison would be to feed the reformulated query into GQA_RDF and switch RDFS reasoning off. Then, comparing the performance of GQA_RDF with gStore and Turbo_Hom++ would make sense since all systems would be applied to problems they were specifically designed for. Once this baseline comparison has been established, one could then switch RDFS reasoning in GQA_RDF on and show how this improves performance.
The authors contain many other types of tests in the paper that aim to estimate the contributions of different parts of the algorithm. As far as I can tell, however, all of these comparisons are unfair for exactly the same reason as I outline above, and therefore I question their usefulness.
|