smart-KG: Partition-Based Linked Data Fragments for Querying Knowledge Graphs

Tracking #: 3518-4732

Amr Azzam
Axel Polleres
Javier D. Fernandez
Maribel Acosta

Responsible editor: 
Cogan Shimizu

Submission type: 
Full Paper
RDF and SPARQL provide a uniform way to publish and query billions of triples in open knowledge graphs (KGs) on the Web. Yet, provisioning of a fast, reliable, and responsive live querying solution for open KGs is still hardly possible through SPARQL endpoints alone: while such endpoints provide a remarkable performance for single queries, they typically can not cope with highly concurrent query workloads by multiple clients. To mitigate this, the Linked Data Fragments (LDF) framework sparked the design of different alternative low-cost interfaces such as Triple Pattern Fragments (TPF), that partially offload the query processing workload to the client side. On the downside, such interfaces still come with the expense of unnecessarily high network load due to the necessary transfer of intermediate results to the client, leading to query performance degradation compared with endpoints. To address this problem, in the present work, we investigate alternative interfaces, refining and extending the original TPF idea, which also aims at reducing server-resource consumption, by shipping query-relevant partitions of KGs from the server to the client. To this end, first, we align formal definitions and notations of the original LDF framework to uniformly present existing LDF implements and such “partition-based” LDF approaches. These novel LDF interfaces retrieve, instead of the exact triples matching a particular query pattern, a subset of pre-materialized, compressed, partitions of the original graph, containing all answers to a query pattern, to be further evaluated on the client side. As a concrete representative of partition-based LDF, we present smart-KG+, extending and refining our prior work [1] in several respects. Our proposed approach is a step forward towards a better-balanced share of the query processing load between clients and servers by shipping graph partitions driven by the structure of RDF graphs to group entities described with the same sets of properties and classes, resulting in significant data transfer reduction. Our experiments demonstrate that the smart-KG+ significantly outperforms existing Web SPARQL interfaces on both pre-existing benchmarks for highly concurrent query execution as well as an accustomed query workload inspired by query logs of existing SPARQL endpoints.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 31/Aug/2023
Major Revision
Review Comment:

The paper has been improved in this version and the authors have made an effort going through all the comments from the reviewers and addressing them as well as they could.

Despite the improvements, the paper still seems to require additional work. There are a very large number of unclear statements in the text, it is also unclear if the experimental setup was fair (e.g., was SaGe allowed to use significantly less resources than other systems?), if the experiments can be reproduced, if the code of smart-KG+ is included in the provided repository.

While a public repository is available, it lacks sufficient documentation to help potential users. For instance, no instructions are provided in the repository to run each of the different systems that were used to obtain the results presented in Section 5.

I could not find the code that implements the algorithms presented in the paper.
The files in the repository look very similar to the files in these repositories: and
Section 5 presents results for different systems (TPF+NP, brTPF+NP, smart-KG+, Wise-KG^{Typed-Family}), but it is unclear how could I obtain results for each of these systems using the code available in the repository. The code in the repository seems to be the one of Wise-KG.

Page 2, footnote 1
There seems to be a typo in "In act"

Page 3, lines 48-49
The text "We refer to the components (i.e. subjects, predicates, and objects) of a single RDF triple t \in G as subj(G), pred(G), obj(G), where the RDF graph G is a finite set of such triple" seems wrong and to mix the two notions introduced in the following paragraph: subj(G)/pred(G)/obj(G) and subj(t)/pred(t)/obj(t)

Page 4, line 6
The text "in the context of the later introduced frameworks" lacks references that make clear which frameworks are meant there

Page 5, line 38
The text "whenever the set of bindings Ω is not considered (i.e. only empty binding sets Ω = \emptyset are expected)" AND definitions 2.2 and 2.3, seem to indicate that when \Omega is an empty set so should be s(G, Q, \Omega) and s*(G, Q, \Omega). It seems that there should a mistake in the definitions or the text that follows them

Table 1, Data Dump
\Gamma_1 should be \Gamma_0, given the values of l and o and the text on page 6, line 1

Table 1, SPF
The text "(iterating over increasing offsets o := o + l)" seems to indicate an algorithmic solution instead of a formal definition

Table 1, SPARQL Endpoints
\Phi lacks its parameters, but assuming they are as for other interfaces, it seems unclear why the first parameter "n" would be 1. Maybe that is the case in some possible correspondences, but it is not the case in the one I could expect: as many triples as triple patterns are in Q

Table 1, SAGE
It is unclear why is \Gamma_0 not included
It is also unclear why the first parameter of \Phi is 1 (similar as for SPARQL Endpoints)

Page 7, line 47
It is unclear why is \Gamma_0 not included (it was mentioned in line 1, page 6)

Page 8, lines 10-12
It is unclear is there is any relation between 'n' and 'm'

Page 8, line 27
typo or missing word in the text "on and LDF-server"

Page 9, line 30
It is unclear what is meant by "p = pred(Q) ∩ pred(G)" the right side is a set and the left side is a predicate

Page 10, lines 2-3
"we assume partitions per n object ranges (e.g. from a histogram) can be split into a set of ordered values {v0, ... , vn}" it is unclear if there are n or n+1 ranges

Page 10, line 44
The text related to "the modulo operator" likely comes from an early version and it doesn't seem to be used in line 46

Page 11, line 23
the text "we see various options here" could be supported by adding a few details and references

Page 13, Figure 2
Maybe using F_1, F_2, F_3 in the right side is a mistake (those are defined in the left side)
There is an inconsistency, the predicates should be IRIs as in figure 1 (e.g., :director instead of director)

Page 13, lines 20-21
The text is inconsistent with the predicates and classes used in the example given just before
The statements in lines 46-46 of page 2 and lines 24-25 of page 13 are slightly inconsistent, was it 88% or 90%?

Page 19, lines 6-7
The definition of Q seems to have been written twice in the same sentence

Page 19, line 30
The text "If the result is trivially empty (lines 4-5), it returns an empty plan." is unclear.

There are some inconsistencies in Algorithm 2 and the text that describes it, in the algorithm it is stated that TPF is used to evaluate some subplans, but the text mentions brTPF. It might be due to the use of "TPF" to denote brTPF requests

Control has been used as: "a control parameter to transfer intermediate bindings together with query patterns, or a control to specify the page size..." (page 5), but later is used as "TPF control" (page 19, line 50), "brTPF LDF API control TPF(Q_s, \Omega)" (page 20, line 49). It is unclear what is meant by control and if it is meant the same in these three places. On page 19, it seemed to denote keywords "TPF" or "SKG". The use of TPF and \Omega and the connection to the brTPF selector seems misleading (in Table 1 it was stated that the main difference between the selectors of TPF and brTPF is that \Omega is empty for TPF).

Page 21, lines 1-2
The 's' seems to disappear and Q_s becomes Q. It is unclear if it means the entire query, possibly with more than one star

Page 22, Algorithm 3
It is unclear if the use of solution mappings from previous stars is restricted to the last star. For instance, for a plan with four stars that use brTPF for the first 3 stars and SKG for the last one, it seems that the algorithms would not use the solution mappings when evaluating the 2nd and 3rd stars labelled with brTPF even if those are available from the evaluation of previous stars using brTPF.

Page 23
The text: "We implement server-side query planner to re-order the star-subqueries and triple patterns based cardinality estimations available at the server." is likely missing one or more words.
In [31], it was shown that SaGe performs better when 4 workers are available (compared to having 1 worker), but it is unclear if 4 workers a good choice for the setup in this paper given the server specifications, 32 workers could have been a better choice for SaGe

Page 24
It seems odd that the 3 sub-workloads watdiv-stf have so different sizes. It is unclear how these sizes and queries were chosen. From 156 queries to 6, 14 or 23 seems like a very big change. Workloads with so few queries do not seem to be stress testing. A possible cause could be that the predicate rdf:type doesn't occur so often in the original workload

typo in watdiv-stfboth

Pages 25-26
The value of \tau_{class}_{high} for DBpedia is inconsistent in the text and table 5 (the text states 0.01/100 as value and the table states 0.1/100)
While it is good that the DBpedia Analysis information is provided in the repository, it lacks some description to help others understand what was done
It is unclear how the three different configurations used in the Ablation Study can be obtained / tested using the code in the repository

Tables 6 and 7
None of the columns is labeled as smart-KG+. It is unclear if all the systems used to obtain those results are still missing an important component of smart-KG+

Page 26, lines 43-44
The text "This is attributed to the query execution strategy of smart-KG+, which always pushes intermediate results to brTPF, instead of joining the intermediate results entirely at the client-side." significantly differs from what is stated in Algorithm 3 (where intermediate results seem to be ignored for all but the last star)

Maybe the client could be a bit smarter and only use solution mappings when beneficial. That would likely also help with what is mentioned on page 35: "while brTPF has a poor performance since some of the triple patterns are non-selective (even though the entire star-query is highly selective)"

Section 5.7 omits results for watDiv1B

Section 5.7.3 also omits results for watDiv100M

smart-KG+ seems to have three main improvements with respect to existing systems: i) typed-partitions; ii) server-side planning; iii) use of brTPF instead of TPF. But Section 5 seems to mainly focus on ii and iii, or i, and it is unclear if all three improvements have been considered for some of the experiments. For example, Section 5.3 considers two versions (one with ii and one with ii and iii / but it is unclear if they had i), Section 5.4 does not consider i, Section 5.7 only considers i. It is unclear if Section 5.5 uses i.

The last sentence in the caption of Table 11 seems redundant

Page 36, line 21
Maybe there is a typo in "watdiv-btf_{unbounded}" and it should be "watdiv-stf_{unbounded}". In any case, this text does not seem to be consistent with Table 11

Numbers in Table 11 do not seem to support the following lesson learned about WiseKG: "The results show that typed-family partitioning significantly reduces data transfer". When using typed-family partitioning reduces the data transfer, it seems to remain in the same order of magnitude. But in the case when it is increased, it seems to be increased by an order of magnitude. Maybe if larger datasets had been considered it would be clearer how much using typed-family partitions improves or worsens the performance of WiseKG.

I disagree with the reply to comment 49: showing that a property holds for an instance of size 2 is not equivalent to having done a proof by induction.

Review #2
By Stasinos Konstantopoulos submitted on 11/Sep/2023
Review Comment:

The submission presents a method for SPARQL query processing that joins the trend for light-weight servers and explores a new point on the trade-off between overloading the servers and transporting too much data to the client. This is an interesting method and a reasonable path to explore in the TPF/SaGe line of research.

Regarding the potential impact of the work presented, the specific objection I expressed in the first review (i.e., sensitivity to parameterization) has been given appropriate care and the relevant section (Sect 4.2.2) has been improved with concrete details about the rationale for selecting parameters. In any case, my being reluctant about the practical applicability of the system as it currently stands should not stand in the way of publishing the ideas and experimental results, as these can be a useful stepping stone towards a more complete solution.

The text has improved significantly wrt. the first draft.
Some minor comments:

Sect 3.2, definition of \cal{G}:

My comment has been addressed, at least at the formal level, by specifiying an ordering over canonicalized graphs. However, I still do not understand why ordering is required, where it seems to me that any bucketing scheme will do. Am I missing something?

Section 5, p. 13, line 38, "disjoint partitions":

With the clarification you added, the text is clear. I still do not understand why "subsets" or "parts" won't do, but it's your text.

Review #3
By Vojtěch Svátek submitted on 12/Sep/2023
Review Comment:

I went through the author response and both the revised and original version, and see all my comments addressed (aside a few very marginal ones). I thus assume that the paper can be accepted as is.