smart-KG: Partition-Based Linked Data Fragments for Querying Knowledge Graphs

Tracking #: 3232-4446

Amr Azzam
Axel Polleres
Javier D. Fernandez
Maribel Acosta

Responsible editor: 
Cogan Shimizu

Submission type: 
Full Paper
RDF and SPARQL provide a uniform way to publish and query billions of triples in open knowledge graphs (KGs) on the Web. Yet, provisioning of a fast, reliable, and responsive live querying solution for open KGs is still hardly possible through SPARQL endpoints alone: while such endpoints provide a remarkable performance for single queries, they typically can not cope with highly concurrent query workloads by multiple clients. To mitigate this, the Linked Data Fragments (LDF) framework sparked the design of different alternative low-cost interfaces such as Triple Pattern Fragments (TPF), that partially offload the query processing workload to the client side. On the downside, such interfaces come with the expense of higher network load due to the necessary transfer of intermediate results to the client, also leading to query performance degradation compared with endpoints. To address this problem, in this work, we investigate alternative interfaces able to ship partitions of KGs from the server to the client, which aim at reducing server-resource consumption. To this extent, first, we align formal definitions and notations of the original LDF framework to uniformly present partition-based LDF approaches. These novel LDF interfaces retrieve, instead of the exact triples matching a particular query pattern, a subset of partitions from materialized, compressed graph partitions to be further evaluated on the client side. Then, we present \approach, a concrete partition-based LDF approach. Our proposed approach is a step forward towards a better-balanced share of query processing load between clients and servers by shipping graph partitions driven by the structure of RDF graphs to group entities described with the same sets of properties and classes, resulting in significant data transfer reduction. Our experiments demonstrate that \approach significantly outperforms existing Web SPARQL interfaces on both pre-existing benchmarks for highly concurrent query execution as well as a novel query workload benchmark we introduce -- inspired by query logs of existing SPARQL endpoints.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 28/Sep/2022
Major Revision
Review Comment:

The paper presents an extension of smart-KG, initially proposed in a conference paper as a novel approach to share query processing load between clients and server.
smart-KG generates plans that combine shipping of precomputed and compressed partitions (to the client) and evaluation of single triple patterns by the server. The client uses the partitions to process star shaped subqueries and evaluates joins (and other operators) to produce the query result.
The extension presented in this paper comprises:
- server side planning that exploits cardinality estimations to produce a better plan
- precomputation of typed-family partitions
- use of the brTPF interface to reduce the data transfer and number of requests when evaluation single triple patterns (TPF was used before)

It would have been great if the individual contribution of each of these three changes would have been evaluated separately. While there are some experiments studying the impact of typed-family partitions, both versions of the system used in that section seem to be combined with the brTPF interface and new planning

The manuscript was submitted on August 30th, 2022. It has been over a year since the publication of the authors' previous work [14], however neither SPF or wiseKG were part of the state-of-the-art systems considered in the experiments. How does the system proposed in this paper compare against these two systems? Why is it okay to exclude them from the experiments? Why wasn't brTPF used in the experiments (on page 6, it states that brTPF provides "a higher query performance than TPF")?

The abstract mentions a novel query workload benchmark, but the characteristics of that "benchmark" are not clearly described (more details about missing information are provided later)

The introduction mentions that the paper "proposes and formalize, and extend smart-KG", but wasn't smart-KG proposed in a conference paper in 2020 ([15])?

What are the main differences between the "novel smart-KG server-side partition-aware query planner" proposed in this paper and the one proposed in [14]?

Is there any connection between contribution C1 and Section 2.1 in [14]? For example, Definition 2.1 looks a lot like Definition 1 in [14].

s(G, Q) in line 46 (page 7) (also in line 35, page 8) was meant to be s*(G, Q, ?) introduced in Definition 2.3? or does Partition-based LDF not follow the formalization introduced in Section 2? If it is not used, why is it introduced?

Why is it good to use TPF (or brTPF) to evaluate triple patterns with variables as predicates?

What are the limitations of the alternative proposed towards the end of section 3.2?

How are the admissible queries for Hash Partitioning defined?

Could the concrete implementation of partition-based LDF presented in Section 3.4 be defined without considering caching?

Gi and Fi in line 41 (page 11) seem to be unspecified ( quite differently from Gi in sections 3.2 and 3.3 where Gi is clearly defined )

The text in section 3.6.1 is not clear enough. What is exactly meant by "very common classes"? many resources that are explicitly described as instances of the same OWL class? does it refer to explicitly defined subclasses? does it refer to resources described as instances of multiple classes connected to each other using rdfs:subClassOf? How are instances of the same OWL class but that belong to different families handled? How are resources that are explicitly described as instances of multiple classes handled? do multiple typed family partitions exist for subjects that are instances of more than one class?

The definition of s(G, Q) in line 28 (page 12) is not clear enough. Would it be always an empty set because none of the elements of G (defined in line 26, page 11) satisfies the conditions mentioned in this definition?

If typed family-partitioning is used, how does it change the admissible queries?

The description of smart-KG in Section 4 seems unprecise. Which intermediate results are shipped? in which direction?

Has the impact of reducing the set of predicates allowed in the families been studied? While transferring a smaller family partition intuitively means that there would be less data transfer, the positive or negative impact of this restriction should depend on the executed query. If reducing the set of predicates in the relevant family partition means that a larger family partition and a large number of intermediate results (obtained using a brTPF interface) have to be transferred, then the impact on performance seems to be negative

The claims in Section 5.1 should be supported by relevant statistics. How is the frequency of families in real-world KGs? Revise the use of "frequent entity" and "infrequent entity". It is unclear why the number of triples where an entity occurs has an impact on the size of the families. Unless is the size connected to the number of predicates instead of number of subjects, but then the connection with equation 12 is unclear.

The combination of family grouping and typed family partitions is unclear. When both are used, is rdf:type allowed in the set of restricted predicates? does it have to be excluded? When both a merged family and typed family could be used to process the query, are both transferred to the client? if not, which one is preferred?

The definitions between line 22 and line 30 (page 14) are not clear enough. According to line 27, G'_I includes the union of the partitions G'_i, union means triples with common and non-common predicates. According to line 29, G'_I includes only the triples with subjects that had family partitions exclusively comprised of common predicates. Then, there are two different definitions for the same term and it does not seem consistent with the example and explanation that follows. The first one would be transferring too many triples (and it seems to be something that this definition was going to prevent). The second one would be transferring too few triples likely leading to incomplete results.

If, differently from Gubichev and Neumann [27], not all the possible orders are considered during query planning, then how is the order of the triple patterns within a star shaped pattern determined?

It is unclear how the use of HashMaps can prevent revisiting the same intersections more than once (lines 21-22, page 15). Different pairs of sets could have the same intersection.

Revise the use of "infrequent predicate" (line 42, page 15). If it is infrequent, how can it occur in "many different families"?

Complement the examples in (i) (section 5.2) with concrete values as is done in (ii), e.g., how frequent is a predicate, how infrequently a predicate occurs in the analyzed query logs? Consider if these values are more informative as absolute values or percentages or something else

Some choices in Section 5 seem to target the efficient processing of queries that focus on the most structured part of the KG, but how do these choice impact the processing of queries that need less structured parts of the KG? Is their performance negatively impacted?

Equation 14 and the following explanation are not clear enough. It is unclear what "line 2" and "line 3" refer to. The connection between footnote 16 and equation 14 and (ii) is not clear. Equation 14 seems to give an UPPER bound for the size of the partition, then the connection to issue (iii) is unclear. Are core families merged with all the non-core families that they overlap with?

Definitions in lines 16-21 (page 17) are not clear enough. It would be great if precise definitions could be provided. Why is it okay to use minimal subset? doesn't the use of minimal subset lead to incomplete query results?

Consider reducing the use of footnotes. For example, try to combine relevant text currently provided in footnotes with the regular text.

Consider introducing Q' and Q'' before using them in plans (example in lines 44-48, page 18)

Text in lines 40-43 seems unnecessarily unclear. If brTPF is used, why to use the TPF selector function? According to table 1, they differ in the admissible values of omega. How is it based on brTPF requests? one TPF LDF API request corresponds to how many brTPF requests? which ones?

Lines 44-45 (page 19): only a graph, a set of family partitions, a set of typed-family partitions. What is exactly returned? a set of triples? a set of sets of triples? How is it determined if it is a set of family partitions OR a set of typed-family partitions?

Revise text in lines 24-28 (page 20), line 5 doesn't seem to represent a left subtree.

What is the difference between \tau_l, \tau_h and tau_{p_{low}} and \tau_{p_{high}}?
There is a large number of parameter values described in Section 7.1.1, consider using a table to provide an overview. Some values were not explicitly provided (e.g., \alpha_t) and some values were only provided for some of the datasets (e.g., only for DBpedia)

If the authors were interested in showing how well their system worked for real-world KGs with different characteristics, what prevents them from using query workloads such as the ones that can be generated using LSQ [60] and Feasible [61]? (for a revised version, they could also consider

The query workload descriptions are not clear enough. For watdiv-btt, it is mentioned that a set of 20 queries was generated for each client (line 37, page 22) and 2432 is the number of queries according to table 4. Does it mean that there are 2432/20 (121.6) clients? How were allocated these queries to clients? are there clients that execute the same query? For watdiv-sts there are 19968 queries according to table 4, and each client workload has 154 non-overlapping queries, how were those workloads created? were there 19968 / 154 ( 129.66 ) clients? How representative is one set of 16 random queries out of 1753067 queries? What are the characteristics of those 16 random queries?

How were the queries for the watdiv-btf, watdiv-stf, DBpedia-btt workloads selected? what are their characteristics? How many queries were included in watdiv-stf?

Did the evaluation of the proposed typed family partitioning excludes queries that do not include triple patterns with rdf:type predicate? Did it include queries without star patterns?

How is possible that 128 clients had 156GB of RAM in a machine with 2TB of RAM? It seems like there was not enough memory to ensure that each client could execute its queries without affecting the resources of other clients

When LSQ is mentioned in Section 7.1.3 (e.g., line 45, page 22; table 4), does it refer to DBpedia queries?

Given the values of \tau_{c_{low}} and \tau_{c_{high}} in line 43 (page 24), the experiments seems to pick a very specific set of classes. How were these values decided? do these parameters need to have the same value? how different would be the results for different values of these parameters?

Could the explanations given in lines 25-31 (page 27) be more precise? How is the contribution of each reason?

HTTP caching was used for TPF request (lines 37-38, page 27), would a similar caching of popular families improve smart-KG's performance?

Explanations in lines 40-43 (page 28): how is the contribution of each reason?

Why does the last part of section 7.5 focus on smart-KG stating that "smart-KG potentially needs a higher client RAM than the compared systems" when figure 8a seem to show that smart-KG2020 transfers more data and potentially uses more client RAM?

Which of the client specifications were used for Section 7.6? Were these experiments performed using WatDiv1B? If yes, how were the results? If no, why?

How were the queries shown in Figure 9 selected? why to use different sets of queries for "unbounded" and "bounded"?

Line 39 (page 30) states "typed-family partitioning has no influence on the queries of the watdiv-btf_{unbounded}", but in Figure 10, we see that the results are different, in particular for 10M and S6. If that is not related to typed-family partitioning, then what other differences are there between the two systems that are compared?

The description of DBpedia-btt_{bounded} given in section 7.1.3 says that it includes 8 queries, but figure 11 includes 19 queries. How can both be true? Do queries in Q1-Q19 include the same query more than once?

The results for queries Q1-Q19 presented in figure 11 could be complemented with information about the data transfer and number of requests of each of these queries.

Explanations in Section 7.6.2 suggest that intermediate results have a huge impact on performance, but at the same time it is not so clear how representative the queries in that experiment are in terms of diverse combinations of intermediate results. Would the results be very different for another set of queries?

Why is it okay to include the following restrictions in the proof of proposition 1: Q is composed of a pair of triple patterns, variable-free predicate p (do both triple patterns have the same predicate?), p is included in P'_G?

Why does the proof of proposition 1 refer to equation 14? wouldn't \mu_G(f) be used?

The authors do not seem to meet the requirement to provide a URL to a file with resources. A link to a repository was provided, but it was not possible to access the content without requesting an account.

Possible typos / minor comments:
- The end of line 42 on page 2
- In line 44 (page 3), should it be "subj(t), pred(t), obj(t)"?
- n, m in lines 43, 44 (page 7) were meant to be the same variable?
- is the set inclusion in line 23 (page 8) an equality?
- what is s(Q) in line 25 (page 8)
- triple patterns queries (line 8, page 9)
- (?s, p, ?o) was meant to be (?s, ?p, ?o)? (line 39, page 9) (line 23, page 10)
- "bound rdf:type predicates" -> does bound really refer to the predicate? or to the object? (line 8, page 12)
- "decomposes the input query into a set of o star-shaped subqueries" (lines 18-19, page 13)
- m, m' in lines 37-38 (page 13) were meant to be the same variable?
- the line numbers used when describing the algorithms, e.g., lines 4, 5 seems to be incorrectly referenced when describing Algorithm 1. For Algorithm 2, lines 8, 9, 11, 13, 14, 15, 17, 18 seem to be incorrectly referenced. For Algorithm 3, line 6 seems incorrectly referenced
- Revise lines 2, 3 of Algorithm 1. Are f and F'_i the same?
- "all families in G could prohibitive" (line 35, page 15)
- (iiii) --> (iv) (line 33, page 16)
- "the smart-KG materializes" (line 21, page 17)
- "how smart-KG server" (lines 31-32, page 17)
- Are the two Q's in different font styles in line 37 (page 17) the same?
- The statement in footnote 18 is not true (Algorithm 1 as presented in page 15 doesn't do that)
- "Plan(Q) is shown in Alg. 2" (line 14, page 18) but Algorithm 2 describes Plan(Q, P'_G, C'_G)
- "characterizes an potentially" (line 21, page 18)
- line 19 in Algorithm 2 seems to be misaligned
- "(i.e." is not closed (line 30, page 20)
- "the same thresholds use in" (line 12, page 22)
- "a real-world dataset such as DBpedia" (lines 27-28, page 22) --> is it DBpedia? or another one?
- "the percentage of timeouts drastically increases for TPF... from 44% in 1-client workload to 56% with 128 clients" (lines 32-33, page 26) --> it doesn't seem drastic
- The paper suggests two version of smart-KG, so it seems unnecessary to have more than two names for them (e.g., smart-KG, smart-KG2020, smartKG 2020, smartKG, smartKG 1.0)
- "with a better geometic in the" (line 32, page 28)
- "to consider the two execution shipping partition and brTPF based on cardinality estimations" (line 23, page 32)
- "First, the optimizer devises an ordering of the star-shaped queries Qi and the triple patterns within Qi (Alg. 2, lines 3–7)" --> but Algorithm 2 does not describe how is the ordering of triple patterns within Qi

Review #2
Anonymous submitted on 15/Apr/2023
Minor Revision
Review Comment:

The paper presents an extension of LDF, called smart-KG, which essentially aims to achieve a better load-balancing between servers and clients for SPARQL queries executed against a federated architecture of RDF resources (or so called “SPARQL endpoints”). The journal submission is largely based on a previous WWW 2020 (and to some extent also another WWW 2021 paper) by the same/similar authors.

The presented journal version extends the original idea of family partitioning by a more generalized form of “typed” family partitioning which also considers the entities’ types for the partitions generated based on precomputed star query patterns. A very detailed experimental evaluation is provided in comparison to the old and new partitioning schemes as well as against a set of common baselines in the sematic web domain.

The paper itself is very well presented and provides a clean mathematical formalization of the framework. Overall, I see a sufficient novelty in the contributions to accept the paper for the Sematic Web Journal and will refrain to only the following comments for a (minor) revision:

- broken line on Page 2: and found that … 90% of the queries
- the notation for s(G,Q) seems overloaded: first it is defined with two or three arguments, later it is used with only one as s(Q) where then however the specific partitioning G for the star-queries seems lost
- the shipping of “entire HDT compressed family partitions” to the clients in principle seems like a questionable approach to me, which obviously involves a lot of caching tradeoffs by the clients and probably only makes sense when the clients can reuse the partitions across multiple recurring query patterns; these tradeoffs (e.g. also the plain byte-sizes) of the shipped partitions should be discussed and compared in more detail
- also the initial overhead of shipping the partitions to the clients, as well the precomputation of the partitions at the server, should be reflected in more detail
- the distributed processing of RDF/SPARQL also was a hot topic in databases about 10-15 years ago; perhaps a more detailed comparison against non-federated architectures such as TriAD, TLDR, RDF-3x, etc. (at least as baselines) could be considered in addition (and at least for kinds of star-queries considered here)

Review #3
By Vojtěch Svátek submitted on 27/May/2023
Major Revision
Review Comment:

The paper describes an implemented multi-part method for KG querying, primarily relying on serving a novel kind of graph partitions from the server side.
The topic is highly relevant for SWJ, and the extent of original theoretical and empirical results presented in paper appears quite solid. The contribution to the progress in RDF KG querying, conveyed through the text, is significant in my opinion. (However, I am not deeply knowledgeable in this field, so I cannot fully judge these aspects). My critical comments below mainly regard the writing quality, which is definitely not low, but should be improved along several lines.
Unfortunately, I wasn’t able to access the data files at the provided link It required me to provide credentials for GitLab Enterprise Edition, which I do not have (the normal github password did not work). Maybe the issue was just the lack of some technical skill on my side – but I would normally expect that the repo should be public without any reading restrictions.
As regards the impressive amount of presented results, there are possibly even too many technical details, which may overwhelm the reader, while I miss a concise presentation of the main findings and lessons learned. There is no Discussion section following the experimental sections. There are many digressions popping up throughout the paper, which seem mostly relevant, but are presented in such a dense manner that they do not clarify the overall picture (one example for many: reference to the authors’ own work re WiseKG – dynamic load delegation – in the last para of the Conclusions).
What I find impractical is the way how the formal description of the partitioning methods is interleaved with the description of technical details of the particular implementation. In Section 4 the authors start to describe their tool, then follows Section 5 on “smart-KG Partition Generator” (where one would thus expect some engineering content), and then suddenly the new formal notion of predicate-restricted family is being defined. I would prefer to have the whole formal apparatus in one place (starting from pre-existing models, then moving to the new contribution presented in this paper – which are actually sometimes also a bit hard to tell apart), and only then following up with the details of the smart-KG framework and with the experiments carried out through it. Perhaps the authors are ambivalent between presenting smart-KG as a general method or a software framework? On the one hand, they label it with a scientifically meaningless “marketing” brand, but on the other hand, they speak about “proposing” and “formalizing” and “extending” it (p. 2, l. 45), which indicates that it should be treated as a complex variant of an abstract partitioning method, thus deserving a rather technical calling (perhaps foldable into some acronym, analogously to e.g. “brTPF”).
An aspect of the paper that may be quite valuable if more elaborated is the empirical analysis of KG partitioning. When reading through the Sections 2 and 3, I was actually looking for empirical material giving me a better idea on how both the number and size of partitions differs for real-world KGs when switching between different partitioning approaches. However, Table 3 only shows the figures for a single real-world KG, DBpedia. There is also an analogous table, Table 8, in the appendix, for four other KGs. Yet, the amount of information per KG is rather limited: only the authors’ partitioning approach, and only the number of sets, but not their (average, min+max, median…) size. The rest of empirical findings in the paper is devoted to performance comparisons, which is of course of high practical importance, but only reveals the final effects and not the possible causes behind them, which are likely to be largely correlated with the partitions’ quantitative features – and, of course, on the nature and provenance of the KG, which impacts the patterns present in it (see also my minor comment below).
Since the paper text is rich in formulas, I would strongly recommend to number them and use the numbers for cross-referencing.

Detailed, mostly minor comments:
- As regards the SPARQLES-based results on low SPARQL endpoint uptimes (p. 2): while the performance downsides of server-only computation surely play a role for some of them, I expect that possibly an even larger share on this effect may be caused by poor maintenance of those endpoints, which had often been set up for the purpose of a research project and have never enjoyed extensive traffic. BTW, the SPARQLES statistics is from August 2022, so it could perhaps be updated in the next version of the paper.
- Either the term IRI (preferably) or URI should be consistently used in the paper, but not an arbitrary mix of them.
- At the beginning of p. 4, familiarity with the notion (and associated “butterfly” notation) of query execution plan is taken for granted on the reader’s side (while basic notions related to KGs that are probably more widely known to the SWJ audience are properly explained in other places). I suggest to lower the expectations in this respect.
- The para starting with “The selector function s has as parameters an RDF graph G, a SPARQL pattern Q, and a set of bindings…” should be integrated into the preceding formal Def 2.1.
- I do not understand what the double vertical bar (“parallels” of two substitutions) means in the end of Def 2.3.
- I am a bit confused by the notions of limit and offset as in the description of TPF in Table 1: “chunks of n triples, whereas limit l = 1 as it is possible to only retrieve one page at a time, and offset o is the page requested by the client”. I am not deeply familiar with TPF, but would expect that the limit of 1 would mean one triple rather than one page at a time? Similarly, the note on the offset looks strange to me.
- In Table 1, too (just a minor detail), the text on the binding set (\Omega) should probably have the same wording for brTPF and SPF?
- In the first para of 2.3.2 I would just like to check why there are two sets of Gs, one with cardinality n and one with cardinality m. Is it the case that the first is the temporary set of non-disjoint partitions, while the latter is the set of disjoint partitions that is eventually returned?
- P. 8: “the union of these partitions does contain all, but not too many unnecessary triples for computing the actual query” Perhaps, “all necessary, but not too many unnecessary…”?
- P. 9 “if we assume partitions per n object ranges (e.g. from a histogram)” – I do not understand the meaning of “histogram” here – for me it is just a bar chart, but this probably does not make sense here
- P. 9 “where subgraphs containing all/exactly the results of particularly common sub-queries” Does “all” mean “superset of” here? Or how does it differ from “exactly”?
- P. 11: “The assumption here is that in real-world RDF graphs, subjects with similar characteristics are described in the same fashion forming such common star-structures, i.e. many subjects of the same type share the same combinations of predicates.” Presented this way, the assumption seems a bit too strong. It should be perhaps made clear as early as here that the abundance of this patterns strongly depends on the source/nature of the KG. While KGs refactored from tabular datasets will often look this way, crowdsourced sources (such as DBpedia, actually mentioned in the next example) may contain many missing values. Some remarks later in the paper actually seem to confirm this intuition.
- P.11: “Inspired by the ideas of characteristic sets, we define the predicate family, …” Actually, the difference between characteristic sets and predicate family should be rather explained formally. “Inspired by” looks too vague here.
- P. 11, l.26: the prime in p’, in the def. of F(s), is probably not needed?
- P. 24: Just for typographic concerns, I would discourage from filling a whole subsection (7.1.5) just with a bulleted list, without any introductory text.
- P. 32: “A recent study [58]…” I might not label a 2014 paper as “recent”, given the fast pace in this research subfield.
Typos and grammar:
- P. 2: “bounded objects”, “bounded type predicate” – should probably be “bound”? similarly p. 8 “works well for triple pattern queries with bounded predicates, whereas for any triple patterns with unbound predicates…”
- Occasional grammatical number mismatches, such as “smart-KG… use 5 times less network traffic” (p. 3)
- P. 4 “we will herein particularly rely on HDT [24], a well-known compressed format for RDF graphs, permits” … missing “which” or similar
- P. 5 “while processing more complex patterns to the client” … perhaps “leaving the processing…”?
- P. 5 “However, In general” (capitalized “I”)
- P. 9 “you only need to ship partitions for predicates mentioned in the query, that also occur in G” – superfluous comma
- P. 9 “is that it only supported single triple queries, any joins or more complex patterns would need” – probably “…, while any joins…”
- P. 11 “are different than those describe Persons” – verb form
- P. 12, l. 32 “we will also exploit the fact, that predicate families or characteristic sets, have been used successfully” Both commas are superfluous
- P. 13, l.14 “practical partition generator)for a given knowledge graph” Missing blank
- P. 13, l.21 “he client then, evaluates” - superfluous comma
- P. 13, l. 37 “Analogously, we denote as F′(G) = …, to the set of different predicate-restricted families” Probably, superfluous “to”
- P. 15, l. 25 “could prohibitive” Missing verb
- P. 32, l. 35 “Note that, the typed-family partitions” - superfluous comma.
- Many bibitems are BibTeX-decapitalized. Personally, I would also prefer the publication year to also appear in the conference acronym rather than just next to the authors’ list. Formally, nothing like “Proc. Of WWW” exists, there are always just the proceedings of a concrete edition. Same might perhaps even apply to journals.
Overall, the paper is well-focused and its content potentially quite valuable to the community. However, the readability and overall narrative should be improved as suggested before its final acceptance. Also, the issue with the data accessibility should be sorted out (unless it is just some blindness on my side).

Review #4
By Stasinos Konstantopoulos submitted on 11/Jun/2023
Major Revision
Review Comment:

The submission presents a method for SPARQL query processing that joins the trend for light-weight servers and explores a new point on the trade-off between overloading the servers and transporting too much data to the client. This is an interesting method and a reasonable path to explore in the TPF/SaGe line of research.

The "Long-term stable URL for resources" is not accessible, so it could not be assessed. The URL points to an institutional git server, which is not the best option for FAIR publishing.

The quality of the text is not really good, with lots of places where the text is unclear, not very helpful for the reader, or not very carefully thought out. See the detailed comments below.

The method is not expected to have significant impact, as there are several issues left open. The most obvious one, in my opinion, is that it seems to be very sensitive to careful parameterization. Sect 5.2 offers some guidelines and intuitions, which are useful but not at the level of concreteness that would allow publishers to configure their servers. Having said that, the submission presents experiments that are worth sharing with the community, and can be a useful stepping stone towards a more complete solution.

Detailed comments:

Sect 3.2, definition of \cal{G}:

The way this is formulated implies that objects are from a concrete domain with full ordering. This is a restriction that is not in the text, so there is a mismatch between the text and the formal definition. In any case, the restriction is not necessary, any bucketing method will serve your needs here. So the maths should be corrected to use a bucketing scheme, be it numerical thresholds or otherwise.

Sect 3.3, definition of \cal{G}:

Why the two-level hashing? ?s is first hashed and then the modulo of the hash is used. Surely, a modulo in the hash function ensures that hash values are in the range that is appropriate for the application (here, the number of partitions) but at the level of the mathematical formulation this is not needed. Also, strictly speaking, it invalidates the comment in the text that "hash collisions may cause skewed partition sizes", since it is not hash collisions but modulo collisions that skew the distribution among partitions.

Editorial: I think it would read better if the order of Sect 2.2 and 2.3 was swapped.

Sect 3.4: "We leave a concrete formalization/implementation of partition-based LDF following this idea to future work." I do not understand why this, in particular, is deferred to future work. What complications have you encountered that push this outside the scope of this article? In any case, the promise to discuss this as future work is not met in the "Future Work" section.

Sect 3.5: "edges linking border vertices in different partitions"
Why are you qualifying these vertices as "border vertices"? What is the definition of "border vertices"? Just saying "edges linking vertices from different partitions" would have been clear, now I am wondering if the "border vertices" are a subset of the vertices with inter-partition edges.

Sect 3.6, definition of F(G): The usual convention is to use \cal{F} for a set of F.

Also in the same section, you should explain family partitions in the text, in as plain language as possible, to help the reader follow your ideas. My understanding is that what you mean is "All triples such that their subjects have the same predicate family." A simple example would also help. Here's what I noted down trying to work out what the definition means:

For the KB:
w p a . w r b .
u p c . u r d .
v p e .
We get:
F(w) = F(u) = {p,r}
F(v) = {p}
Which gives two G's:
G_{w,u} = { , , , }
G_v = { < v p e> }

Section 3.6.1: It makes little sense to have Subsection 3.6 have a single Subsubsection. Maybe you can consider raising 3.6.1 to 3.7?

Also, in the same section, using the disjointness of predicate URIs and class URIs is a hack. You note that you use this hack to simplify notation, but it doesn't simplify it, in my opinion. You can express in the mathematical formulation exactly what you say in the text, that you consider pairs of F(s) and classes where :

{ | in G }

Similarly to above, you should explain type-based partitions and give an example, ideally elaborating on the example from Sect 3.6.

On p. 13, lines: 23-25: The way this is formulated, this sounds like a minor practical consideration, which is a strange way to end a first-level section. Maybe the final paragraph could let us know how the article proceeds in Sect 5 and 6, which elaborate on the architectural overview given in Sect 4.

Staying on the same passage, this consideration it is not as minor as the text makes it appear: the decision to use HDT alleviates the volume used due to replicating triples, but has an impact on the ability to serve paginated results. I would recommend discussing this trade-off, maybe in the context of future work exploring other approaches to saving disk space or on-the-fly decompression to serve clients that require pagination.

Section 5, p. 13, line 38, "disjoint partitions": Upon first reading, I noted that "disjoint" is redundant because partitions are always disjoint by definition. Further down the text, I realized that what you call "partitions" are not necessarily disjoint. This is abuse of terminology, in my opinion, and must be corrected throughout the text. If I misunderstood, and these are really partitions, please improve the text so that readers do not misunderstand it.

Sect 5.1, p. 14, line 29: This should be the other way around, F(s) is a superset of F(I), since F(I) is the cross-section of many F's.

Same section, line 32: "we only transfer G'_{1,2} rather than G'_1 union G'_2"
From how I understood the above, G'_{1,2} is larger than G'_1 union G'_2 as it also includes all F's that happen to include the predicates in F_I in their families. So this comment does not make sense. Please explain with a lot more detail in the text, and an example.

Minor editorial:

p. 2, at multiple places: low-expressive -> low-expressivity

p. 2, line 42, remove newline.

p. 2, footnote 1, "in act" -> "in fact"

p. 10, line 45: ``caching'' (using the latex quotes) instead of
"caching" (ASCII quotes) looks a lot better.

p. 13, line 19: "a set of o star-shaped subqueries"