GQA_{RDF}: An Efficient SPARQL Query Answering Engine on RDF Graphs

Tracking #: 2922-4136

Authors: 
Chu Huang
Qianzhen Zhang1
Deke Guo
Xiang Zhao
Xi Wang

Responsible editor: 
Philippe Cudre-Mauroux

Submission type: 
Full Paper
Abstract: 
Due to the increasing use of RDF data, efficient processing of SPARQL queries over RDF datasets has become an important issue. In graph-based RDF data management solution, SPARQL queries are translated into subgraph patterns and evaluated over RDF graphs via graph matching. However, answering SPARQL queries requires handing RDF reasoning to model implicit triples in RDF data, which is largely overlooked by existing graph-based solutions. In this paper, we investigate to equip graph-based solution with the important RDF reasoning feature for supporting SPARQL query answering. In detail, we first propose an on-demand saturation strategy, which only selects an RDF fragment that may be potentially affected by the query. Then, we provide a filtering-and-verification framework to efficiently compute the answers of a given query. The framework groups the equivalent entity vertices in the RDF graph to form semantic abstracted graph as index, and further computes the matches according to the multi-grade pruning supported by the index. In order to drive and expedite query evaluation, we conceive an effective cost model for estimating the step-wise cost of query pattern matching. In addition, we show that the semantic abstracted graph and the graph saturation can be efficiently updated upon the changes to the data graph, enabling the framework to cope with dynamic RDF graphs. The results of extensive experiments on real-life and synthetic datasets demonstrate the effectiveness and efficiency of our algorithms.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 12/Nov/2021
Suggestion:
Reject
Review Comment:

This paper presents a technique for answering SPARQL queries under RDF entailment. The technique uses precomputed indexes that identify candidate entities that can map to the variables in the query, after which a filtering technique is used to compute the query answers.

This paper suffers from numerous problems, all of which make the paper unsuitable for publication in SWJ. Moreover, I do not believe that additional revisions can help improve this paper. In particular, I will next present in detail many technical issues, and I believe that to address them would require changing the approach and the experimental setting so much that the result would be a completely different paper. Therefore, I believe that the appropriate course of action is to reject this paper.

The following list summarises my points of criticism at a high level.

* The presented algorithms overlook many important details and, in fact, do not seem to be able to handle all SPARQL BGP queries. Please note that I am not complaining about not handling all of SPARQL; rather, the approach does not seem to be able to handle triple patters of certain forms (such as those containing variables in class or property position, patterns with multiple properties between pairs of variables, or patterns without rdf:type edges). In my detailed comments I give specific examples of places that seem to imply this.

* The paper has been poorly written and is unclear in many places. Formal definitions are used inappropriately. Many details of the presented approach are left unclear.

* The experimental setting seems to be fundamentally flawed. The authors compare the performance of their system with systems that were not designed to support reasoning with RDFS; to overcome this problem, the authors use a reformulation approach when running queries in competitor systems. As I discuss in detail below, this does not facilitate a fair comparison.

I will next present my detailed comments about the paper. I will introduce them in the order in which I found them in the paper (rather than in the order of importance).

--------------------------------------------------------------------------------

- page 1, line 19: "solution" -> "solutions"

- page 1, line 21: "handing" -> "handling"

- page 1, line 22: "feature" -> "features"

- page 1, line 22: "feature" -> "features"

- page 1: change last sentence to "one may use the following SPARQL query:"

- The query on page 2 is syntactically incorrect: should not have angle brackets around it.

- The first paragraph on page 2 does not make much sense to me. What is a relational solution to RDF data management? Do the authors mean “a solution where RDF data is stored in a relational table”? If that is what is meant, then it would be good to be explicit about it. Also, graph-based solutions are not inherently different from relational solutions: all databases I am aware of use various join operators and index. For example, the RDF-3X is not really all that different conceptually from a standard database: it uses hash and merge joins just like a standard database would do, and it uses B*-tree indexes; however, these are adapted to the specifics of RDF. So while there are some differences in how these functions are implemented in generic relational systems and systems that specifically target RDF data, there are sufficient similarities between all of these approaches so that the separation into "relational" and "graph-based" is rather meaningless.

- The second paragraph on page 2 also does not make much sense to me. In what sense is SPARQL easier? In fact, easier than what? I need to point out that it is well known that logic-based perspective on query languages (which is what SPARQL is: it uses variables to formulate a logical expression) is equivalent in terms of expressivity to relational algebra. Moreover, "graph pattern matching" is nothing else than computing joins in a relational sense: every BGP can be translated into a select-project-rename-join query of relational algebra. The computational complexity of “graph pattern matching” and SPRJ-algebra is identical, and the methods to solve these two problems are largely the same. To me, this just shows how the distinction between "relational" and "graph-based" approaches to RDF data management is really arbitrary. The main difference is that RDF systems instantiate a generic database architecture in a way that is more amenable to RDF, that is it!

- In paragraph 2 on page 2, the authors go on to claim that graph-based system have superior performance because of "graph pattern matching". This is not the case: the better performance is due to how various components of a DB architecture are implemented (e.g., indexes and so on).

- page 2, line 28: what is the difference between "query evaluation" and "query answering"?

- page 3, line 33: "experiment results" -> "experimental results"

- page 3, line 45: What is the difference between an entity and a class? In RDF, there is no such difference: RDF just talks about resources and knows nothing about classes.

- The formalisation of an RDF graph in Definition 1 is incorrect in that it overlooks that RDF graphs can contain several edges between the same pair of resources. Specifically, note that E_G is defined to be a *set*, which means that it cannot contain any duplicates; moreover, function L_G is defined to map each edge in E_G to a single property (in fact, the authors say "The label of an edge is its corresponding property"). Therefore, this definition cannot capture an RDF graph that contains triples and : set E_G would contain one edge , but what would function L_G map this edge to? If we define L_G() = P1, then we cannot represent the second triple; and if we define L_G() = P2, then we cannot represent the first triple.

- The mention of "is a valuable feature" in Definition 2 is bad style: definitions should be used to make mathematical statements and not imply judgment about the value of a feature. In fact, as such Definition 2 does not say much -- that is, it does not introduce any formal notions. Therefore, this text should not be a definition.

- Definition 3 is too imprecise to be a definition. To use a (formal) definition, one would actually spell out the entailment process with all its formal details.

- Definition 4 suffers from the same problem as Definition 2: it cannot represent queries containing several triples that mention the same pair of variables but with different labels. In addition, Definition 4 is unclear about the range of L_Q: is the range supposed to include just resources or can it include variables? For example, to represent BGP <?s, ?p, ?o>, function L_Q should map the edge <?s,?o> to the variable ?p.

- Definition 5 seems to answer my question from the previous paragraph: variables on edges are not allowed -- the authors consider only the case where an edge in a query is mapped to a property. This is a significant drawback in terms of expressivity of this approach, but the authors never really consider this drawback explicitly. In my opinion, significant restrictions such as this one should be prominently stated in the introduction (and possibly even the abstract) in order to not misrepresent the contribution.

- Definition 6 is strange. First, a sentence such as "It is important to keep in mind the distinction between query evaluation and query answering" does not belong in a formal definition: the purpose of definitions is to precisely introduce mathematical concepts needed to present the work. Moreover, the distinction between "query evaluation" and "query answering" goes against the common practice in database literature, where the two terms are generally considered synonyms. What authors (erroneously) call "query answering" is usually called "query answering under constraints" (or, equivalently, "query evaluation under constraints"). Changing the meaning of well-known terms from a research area is not good practice, and it just makes understanding the work harder.

- page 5, line 28: "that is adjacent" -> "that are adjacent"

- The paragraph "Offline index construction" on page 5 is quite hard to follow. At this point, the structure of STP is not known, so I cannot understand what is meant with "we merge the entity vertices in the RDF graph that is adjacent to equivalent class vertices".

- On page 6, the authors introduce the STP index. From this description, I could not understand an important point: do these indexes just contain the relationships that are explicitly present in the data, or do they contain also implicit relationships? If it is the former, note that virtually all RDF systems maintain various kinds of indexes that can easily facilitate lookups such as described on page 6. For example, the RDF-3X system maintains six indexes, one for each permutation of s, p, and o. Thus, the answer to SubPro(p) can be obtained by querying the pos index with the word rdfs:subProperty,p. This is, of course, assuming that only the input data should be indexed.

- In contrast, if STP is supposed to contain the implicit relationships, then I believe the authors have overlooked an important fact: relationships in RDF graph are often cyclic, and the algorithm that is (rather informally) described on page 6 does not seem to capture these correctly. For example, consider an RDF graph that contains the following triples:


Neither of these classes can be designated to be a "superclass vertex": each one of them has a superclass. Moreover, it is unclear how the proposed approach handles this situation because is it not precisely specified by what is meant by "to count the times of T that is extracted" -- extracted from where? If "extracted" means that one pass over the input is made, then this description is too complicated: one could simply say "We make a single pass over the input and, for each vertex T_s, we compute num(T_s) as the number of triples of the form ". In the above example, note that num(A) = num(B) = num(C) = 1, no none of them is a superclass vertex. Finally, note that num(T) can be zero, in which case w(T) is undefined because it involves division by zero.

- page 6, line 52: I could not understand the sentence "Specially, if u has no type, we can use STP to derive corresponding type of u." What does "u has no type" mean exactly -- has no type where? Also, if u has no type, how can we derive its type?

- There is another point about the definition of the concept graph on page 6 that I find confusing. As far as I could understand, each node in the concept graph correspond to one class -- that is, L_c(U_c) is just one class. Moreover, the set U_c is supposed to be a partition of the entities from the input graph. This, however, means that a concept graph cannot handle cases where one entity belongs to two unrelated classes. For example, consider the following RDF graphL

Note that classes A and B are unrelated, so they are "superclass vertices". Thus, the concept graph would contain two nodes, U_A and U_B, each representing entities belonging to classes A and B, respectively. But then, where are we going to put x? The authors say that U_A and U_B must partition the set U of all entities of the input graph. Thus, we can add x to either U_A or U_B, but not both. However, if we add x just to to U_A or to U_B, we are losing information about class membership of x.

In addition, there is yet another problem with the definition of concept graph. The authors say that the nodes of a concept graph must partition the entities of the input graph, which I take to mean that each entity in the input graph must occur in exactly one node of the concept graph. (That is the usual meaning of "partition".) Now let us extend the above set of triples with another triple:

The concept graph still contains just U_A and U_B, but where are we going to put y and z? These entities do not have a class, so putting them into either U_A or U_B would be incorrect. Maybe this can be corrected by requiring U_c to be a partition of a *subset* of entities of the input graph -- I do not know. All this, however, just shows to me that the formulation of the approach is extremely messy and not of adequate quality for a journal such as SWJ: a top-quality publication should not contain such ambiguities.

- Terminology in Definition 8 is confusing. The authors say "A semantic abstracted graph is multi-grade concept graph", which then they explain to mean that a "semantic abstract graph" consists of a family of graphs. But then, the name "semantic abstract graph" is misleading: it suggests that there is just one graph to consider. It would be better to call this something like "semantic index".

- page 6, line 28: "trip" -> "triple"

- I found Definition 8 very confusing and hard to follow. What is a child-class? This term was never introduced properly. Moreover, the notation used here is very hard to follow. Nevertheless, after some thinking and looking at Figure 5, I think I get the idea: the idea is to start with the top-level classes in the hierarchy and split the entities for each subclass level. Note that this cannot work correctly with cyclic class hierarchies such as what I outlined above.

- Function SplitMerge is used in Algorithm 1, but is not introduced anywhere in the paper. Thus, I can only guess as to what it does. On page 8 the authors argue about the complexity of this function, but this point is moot since I do not know what the function does.

- I think I understand the idea behind multi-grade filtering on page 8: the idea is to assign to each variable in the query a set of candidate entities so that the edges in the query (irrespective of their labels) can be embedded in a concept graph. This algorithm, however, will be incomplete for several reasons.

First, what if a query does not type each variable? For example, the following SPARQL query does not contain rdf:type edges, so I do not understand what are going to be the candidate for the variables ?X and ?Y:

SELECT ?X ?Y WHERE { ?X :R ?Y }

Second, SPARQL queries can contain variables in class positions. For example, the following is a valid SPARQL query:

SELECT ?X ?Y WHERE { ?X rdf:type ?Y }

Since ?Y is a variable, I do not see how the algorithm from page 8 can process this query. I would like to stress that both examples of queries I give above are used in practice very frequently. I guess an approach that can handle a subset of SPARQL efficiently could still be of scientific interest, but then it is paramount to precisely describe the fragment of SPARQL that is being handled here.

- The approach presented in this paper seems to be closely related to the approach presented in the following paper:

Sairam Gurajada, Stephan Seufert, Iris Miliaraki, Martin Theobald: TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing. SIGMOD Conference 2014: 289-300

In this paper, the authors also create a kind of summary graph, which is used to identify candidate entities that can potentially match variables in the query. The authors should include a detailed comparison of their approach with this well-known approach. The paper I mention above is used in a distributed setting, but summarisation technique presented there can easily be adapted to a centralised setting too.

- The description of the saturation step on page 9 does not contain sufficient technical detail. To improve clarity, the authors should include a pseudo-code of their approach.

- page 9, line 42: "it's" -> "its"

- I could not understand how exactly the neighbourhood encoding works: the description on page 9 is not sufficiently precise and it uses confusing notation. To what exactly are the hash functions of a Bloom filter applied? I get a sense that they are applied to property name, but I am not sure and the paper is not sufficiently precise. Moreover, I could not understand whether properties in a graph have bit strings; if so, I could not understand how they are generated.

- On page 10, equation (7), I could not understand what "where Mi represents the set of matching results for the subgraph of Q that is induced by (v1, v2, ..., vi)" is. In particular, the authors do not specify what "subgraph of Q included by (v1, v2, ..., vi)" is, and they do not state clearly what "matching results" are.

- In fact, I could not understand most of Section 6: there is just not enough detail. Again, it would be good to present the matching algorithm in pseudo-code that would hopefully contain all the details about the cost model.

- To illustrate just one point of confusion, consider the first sentence of Example 4: "Consider the query pattern in Figure 7, and suppose that the vertices ?x and ?z have been matched." What does "have been matched" really mean? Does this mean "fixed to specific entities" or "we have determined the set of candidates"?

- Section 7 is equally not easy to follow.

- I have significant doubts about the experimental setup. In particular, in Section 8.2, it seems to me that the authors compare the number of answers to queries produced by their system with that of gStore, but this comparison does not make much sense to me. Unless I am mistaken, gStore never claims to perform any reasoning, so it is not surprising that GQA_RDF produces more answers. In that sense, the comparison is completely unfair or, at he very least, completely uninformative. Note that query answering under RDFS semantics is a well-defined problem that has a precisely defined answer on each input graph. Thus, any system that claims to do reasoning is *obliged* to produce the exact set of relevant answers for each query, or it cannot be taken seriously. In other words, comparing the number of answers between different systems is irrelevant: any system that does not give all answers is simply buggy and should be discounted. In other words, I do not believe that systems giving more answers are somehow "more useful": a system that claims to support RDFS reasoning *must* produce the exact set of answers, no more and no less.

- In Section 8.3, the authors compare the performance of their approach with that of gStore and Turbo_Hom++, and I have significant doubts about that too. Since gStore and Turbo_Hom++ do not natively support RDFS reasoning, the authors reformulated their queries to take RDFS reasoning into account. This, however, raises many important questions.

In particular, it is unclear how the reformulation has been performed. It is well-known that reformulation can significantly increase the size of a query, and this can significantly affect the performance of query answering engines. Moreover, for most queries the reformulation is not unique -- that is, there are many different ways to reformulate a query into another query that takes reasoning into account. The authors never specify which reformulation approach they used (there is no citation in the paper). However, it is well known that some reformulation approaches may produce queries that are easier for RDF engines to process than others. For example, an obvious way to reformulate a query is to replace each triple pattern of the form with a disjunction { } UNION { } UNION ... UNION { }, where C1, ..., Cn are all subclasses of C. If a query contains many triple patterns, the resulting query is a conjunction over a number of disjunctions, and such queries are typically very hard for standard RDF query engines because estimating the correct join order becomes difficult. However, such a query can be transformed into disjunctive normal form (i.e., into a union of BGPs), and answering that will generally be *much* easier for most RDF query engines. Now the transformation into disjucntive normal form can blow up the query size exponentially in the size of the RDFS schema, but there are numerous approaches to further reduce the size of the rewriting. In the research literature on OWL 2 QL, such techniques have been extensively investigated, and they could easily be adapted to the RDFS setting.

To summarise, I consider the comparison with gStore and Turbo_Hom++ seriously unfair. First, these systems were likely given queries for which they have not been specifically tuned. Second, the paper does not even produce sufficient detail to know what kind of queries were given to these systems.

The comparison would be fairer if the authors compared their work with a system that directly implements RDFS query answering. As far as I know, OpenLink Virtuoso is one such (commercial) system: it implements some form of SPARQL query answering over RDF graphs while taking into account reasoning. Being a commercial system, the creators do not specify what algorithms they use, but I can imagine their algorithms have been highly tuned. It would be interesting to see whether GQA_RDF can beat such a system.

I accept that comparing with commercial-grade system might be difficult. However, other measures should then be made to ensure fairness of comparison. A simple way to do this would be to switch reasoning off and compare the "base" performance of GQA_RDF to gStore and Turbo_Hom++. The candidate selection procedure described in the paper should still be effective. Moreover, such a comparison would be fair because all systems would be applied to problems they were specifically designed for.

Another way to facilitate a fairer comparison would be to feed the reformulated query into GQA_RDF and switch RDFS reasoning off. Then, comparing the performance of GQA_RDF with gStore and Turbo_Hom++ would make sense since all systems would be applied to problems they were specifically designed for. Once this baseline comparison has been established, one could then switch RDFS reasoning in GQA_RDF on and show how this improves performance.

The authors contain many other types of tests in the paper that aim to estimate the contributions of different parts of the algorithm. As far as I can tell, however, all of these comparisons are unfair for exactly the same reason as I outline above, and therefore I question their usefulness.

Review #2
By Marvin Hofer submitted on 21/Dec/2021
Suggestion:
Major Revision
Review Comment:

In the paper, the authors propose a framework that focuses on RDF query answering for graph-based RDF data management.
The overall language is sound and comprehensible.
The paper is structured well and follows a transparent methodology.

The authors give a fitting introduction to graphs, RDF data management, and RDF entailment.
The work is motivated by closing the research gap in existing graph-based RDF data management solutions w.r.t to query answering (RDF query evaluation and RDF entailment).
The work (i.e. framework) provides the following set of contributions:
an on-demand saturation strategy for RDF entailment based on an internal data structure (called STP?)
a semantic abstract graph/concept graph as index structure
cost-driven pattern matching supported by a step-wise cost model for query evaluation
an incremental maintenance strategy on the semantic abstract graph index that can handle edge insertions and edge deletions, as well as updating the saturation data structure without recomputing it from scratch
an experimental evaluation benchmark in comparison to related work

Overall, the authors give extensive definitions of their proposed methodology and terminology. Moreover, they sustain these definitions with suitable examples.
However, some open questions remain (below) concerning the presented methodology, and the evaluation and related work sections lack some necessary aspects or detailed explanations.

# Methodology

For instance, in definition 7, by following the class hierarchy in some ontologies, every class is often subClassOf owl:Thing (https://dbpedia.org/ontology/Species).
How is this tackled or avoided? his should be explained.
At p7l40, the empirical study could be better elaborated. Again concerning other ontologies, this can result in too broad concepts (using only a three-grade concept graph).

Another open question is regarding the initial data used for saturation.
Are only explicitly provided RDFS statements considered for saturation and building of the concept graph?
Many new ontologies derive/reference classes and properties from existing well-known ontologies, generally accessible via Linked-Data.
Not resolving these external resources used in the existing RDFS statements could reduce this approach's effectiveness.
As a solution in reformulation, federated queries can be used. Alternatively, the resolution of external resources is made once in the saturation process.

# Evaluation

The framework is compared to related systems using benchmark queries on real-world and synthetic data in the experimental evaluation.
The section is four pages long, with two pages of figures "leaving" only two pages of explanation/insights.

First of all, no code or repository is referenced. So it is not possible to reproduce the results.
In the first paragraph, the authors state, "three algorithms were implemented", but then "four" are enumerated. This is unclear.
From the reader's perspective, it is unclear if code from the other systems was reused or if the authors implemented the related RDF stores themself. Without having reference to the implementations, their correctness is not verifiable.
Providing code for reproducibility will improve the work's scientific sufficiency (but as I noticed, this is also a problem of related work)

The used YAGO version is unclear (YAGO4?).
Both datasets, YAGO or LUBM, are not cited.
Any particular reason to choose the queries form [22] or [23]?)

The four evaluation aspects are well selected but lack some detail.
Why only provide values for the property count in YAGO and LUBM?
It can be interesting how many classes are contained in the datasets concerning RDF entailment. Maybe the class hierarchy can directly affect the results.
In 8.3, only reformulation was applied because saturation takes too much effort or was not doable?

They also claim their system "outperforms state of the art RDF data management systems" (p3l33).
What is with non-graph-based solutions?

# Related Work

The authors argued that there are no other graph-based solutions for query answering (p2l28).
But the related work section is incomplete.
Only in-memory RDF stores are discussed.
There are also other databases such as Virtuoso, Alegrograph, and Stardog (see https://www.w3.org/2014/data-shapes/wiki/Entailment_Support)
For instance, virtuoso has a feature that enables entailment based on reformulation (automatically) internally.

What is with approaches of query answering concerning non RDF data but sharing possible similarities?

---

I suggest a major revision since the work addresses an interesting research gap, but w.r.t the current issues, the quality is insufficient for a journal paper. After incorporating the mentioned aspects, the paper has the potential to be accepted.

Please find below some more remarks and suggestions for improvement.

---

Regarding the reproducibility aspect, I searched for available implementations of other systems.
Here is a list of repositories I found:

RDF-3X https://gitlab.db.in.tum.de/dbtools/rdf3x
Hexstore http://crubier.github.io/Hexastore/
3store https://sourceforge.net/projects/threestore/
TurboHom++ ???
GRIN ???
gStore https://github.com/pkumod/gStore
Grass ???
AMbER ???
OWLIM https://www.ontotext.com/products/graphdb/

However, other projects did not publish their implementations to reproduce their results as well…

---

Comments:

p3 there is a footnote for BGP but not for RDFS and RDF entailment?

p5l26 STP abbreviation was never introduced?

p7l40 empirical study? Could you elaborate this based on used RDF datasets or ontologies (w.r.t class hierarchies)

p7l43 reference to Algo 1 missing (irritating to read because it is on the next page)

p8l46 maybe move the parentheses part to the end of the sentence

p9l1 first sentence looks unfinished (or I could not understand it)

p9l4 <(>v,v’> instead of

p13l25 says 3 algorithms were implemented but 4 are enumerated (maybe was meant to exclude the own algorithm?)

p13l28 citation for CFL [20] is missing but was done again for TurboHom++ and gStore

p13l41 cite LUBM “LUBM: A Benchmark for OWL Knowledge Base Systems ” and Yago again (“YAGO4” or original Yago paper?)

p15l16 Q_6b instead of Q^l_6

p16l25 G is the size of G instead of |G| is the size of G

p16 Fig 12(1) is labeled with IncTree_RDF instead of GQA_RDF?