Review Comment:
The paper presents an extension of smart-KG, initially proposed in a conference paper as a novel approach to share query processing load between clients and server.
smart-KG generates plans that combine shipping of precomputed and compressed partitions (to the client) and evaluation of single triple patterns by the server. The client uses the partitions to process star shaped subqueries and evaluates joins (and other operators) to produce the query result.
The extension presented in this paper comprises:
- server side planning that exploits cardinality estimations to produce a better plan
- precomputation of typed-family partitions
- use of the brTPF interface to reduce the data transfer and number of requests when evaluation single triple patterns (TPF was used before)
It would have been great if the individual contribution of each of these three changes would have been evaluated separately. While there are some experiments studying the impact of typed-family partitions, both versions of the system used in that section seem to be combined with the brTPF interface and new planning
The manuscript was submitted on August 30th, 2022. It has been over a year since the publication of the authors' previous work [14], however neither SPF or wiseKG were part of the state-of-the-art systems considered in the experiments. How does the system proposed in this paper compare against these two systems? Why is it okay to exclude them from the experiments? Why wasn't brTPF used in the experiments (on page 6, it states that brTPF provides "a higher query performance than TPF")?
The abstract mentions a novel query workload benchmark, but the characteristics of that "benchmark" are not clearly described (more details about missing information are provided later)
The introduction mentions that the paper "proposes and formalize, and extend smart-KG", but wasn't smart-KG proposed in a conference paper in 2020 ([15])?
What are the main differences between the "novel smart-KG server-side partition-aware query planner" proposed in this paper and the one proposed in [14]?
Is there any connection between contribution C1 and Section 2.1 in [14]? For example, Definition 2.1 looks a lot like Definition 1 in [14].
s(G, Q) in line 46 (page 7) (also in line 35, page 8) was meant to be s*(G, Q, ?) introduced in Definition 2.3? or does Partition-based LDF not follow the formalization introduced in Section 2? If it is not used, why is it introduced?
Why is it good to use TPF (or brTPF) to evaluate triple patterns with variables as predicates?
What are the limitations of the alternative proposed towards the end of section 3.2?
How are the admissible queries for Hash Partitioning defined?
Could the concrete implementation of partition-based LDF presented in Section 3.4 be defined without considering caching?
Gi and Fi in line 41 (page 11) seem to be unspecified ( quite differently from Gi in sections 3.2 and 3.3 where Gi is clearly defined )
The text in section 3.6.1 is not clear enough. What is exactly meant by "very common classes"? many resources that are explicitly described as instances of the same OWL class? does it refer to explicitly defined subclasses? does it refer to resources described as instances of multiple classes connected to each other using rdfs:subClassOf? How are instances of the same OWL class but that belong to different families handled? How are resources that are explicitly described as instances of multiple classes handled? do multiple typed family partitions exist for subjects that are instances of more than one class?
The definition of s(G, Q) in line 28 (page 12) is not clear enough. Would it be always an empty set because none of the elements of G (defined in line 26, page 11) satisfies the conditions mentioned in this definition?
If typed family-partitioning is used, how does it change the admissible queries?
The description of smart-KG in Section 4 seems unprecise. Which intermediate results are shipped? in which direction?
Has the impact of reducing the set of predicates allowed in the families been studied? While transferring a smaller family partition intuitively means that there would be less data transfer, the positive or negative impact of this restriction should depend on the executed query. If reducing the set of predicates in the relevant family partition means that a larger family partition and a large number of intermediate results (obtained using a brTPF interface) have to be transferred, then the impact on performance seems to be negative
The claims in Section 5.1 should be supported by relevant statistics. How is the frequency of families in real-world KGs? Revise the use of "frequent entity" and "infrequent entity". It is unclear why the number of triples where an entity occurs has an impact on the size of the families. Unless is the size connected to the number of predicates instead of number of subjects, but then the connection with equation 12 is unclear.
The combination of family grouping and typed family partitions is unclear. When both are used, is rdf:type allowed in the set of restricted predicates? does it have to be excluded? When both a merged family and typed family could be used to process the query, are both transferred to the client? if not, which one is preferred?
The definitions between line 22 and line 30 (page 14) are not clear enough. According to line 27, G'_I includes the union of the partitions G'_i, union means triples with common and non-common predicates. According to line 29, G'_I includes only the triples with subjects that had family partitions exclusively comprised of common predicates. Then, there are two different definitions for the same term and it does not seem consistent with the example and explanation that follows. The first one would be transferring too many triples (and it seems to be something that this definition was going to prevent). The second one would be transferring too few triples likely leading to incomplete results.
If, differently from Gubichev and Neumann [27], not all the possible orders are considered during query planning, then how is the order of the triple patterns within a star shaped pattern determined?
It is unclear how the use of HashMaps can prevent revisiting the same intersections more than once (lines 21-22, page 15). Different pairs of sets could have the same intersection.
Revise the use of "infrequent predicate" (line 42, page 15). If it is infrequent, how can it occur in "many different families"?
Complement the examples in (i) (section 5.2) with concrete values as is done in (ii), e.g., how frequent is a predicate, how infrequently a predicate occurs in the analyzed query logs? Consider if these values are more informative as absolute values or percentages or something else
Some choices in Section 5 seem to target the efficient processing of queries that focus on the most structured part of the KG, but how do these choice impact the processing of queries that need less structured parts of the KG? Is their performance negatively impacted?
Equation 14 and the following explanation are not clear enough. It is unclear what "line 2" and "line 3" refer to. The connection between footnote 16 and equation 14 and (ii) is not clear. Equation 14 seems to give an UPPER bound for the size of the partition, then the connection to issue (iii) is unclear. Are core families merged with all the non-core families that they overlap with?
Definitions in lines 16-21 (page 17) are not clear enough. It would be great if precise definitions could be provided. Why is it okay to use minimal subset? doesn't the use of minimal subset lead to incomplete query results?
Consider reducing the use of footnotes. For example, try to combine relevant text currently provided in footnotes with the regular text.
Consider introducing Q' and Q'' before using them in plans (example in lines 44-48, page 18)
Text in lines 40-43 seems unnecessarily unclear. If brTPF is used, why to use the TPF selector function? According to table 1, they differ in the admissible values of omega. How is it based on brTPF requests? one TPF LDF API request corresponds to how many brTPF requests? which ones?
Lines 44-45 (page 19): only a graph, a set of family partitions, a set of typed-family partitions. What is exactly returned? a set of triples? a set of sets of triples? How is it determined if it is a set of family partitions OR a set of typed-family partitions?
Revise text in lines 24-28 (page 20), line 5 doesn't seem to represent a left subtree.
What is the difference between \tau_l, \tau_h and tau_{p_{low}} and \tau_{p_{high}}?
There is a large number of parameter values described in Section 7.1.1, consider using a table to provide an overview. Some values were not explicitly provided (e.g., \alpha_t) and some values were only provided for some of the datasets (e.g., only for DBpedia)
If the authors were interested in showing how well their system worked for real-world KGs with different characteristics, what prevents them from using query workloads such as the ones that can be generated using LSQ [60] and Feasible [61]? (for a revised version, they could also consider https://www.semantic-web-journal.net/content/lsq-20-linked-dataset-sparq...)
The query workload descriptions are not clear enough. For watdiv-btt, it is mentioned that a set of 20 queries was generated for each client (line 37, page 22) and 2432 is the number of queries according to table 4. Does it mean that there are 2432/20 (121.6) clients? How were allocated these queries to clients? are there clients that execute the same query? For watdiv-sts there are 19968 queries according to table 4, and each client workload has 154 non-overlapping queries, how were those workloads created? were there 19968 / 154 ( 129.66 ) clients? How representative is one set of 16 random queries out of 1753067 queries? What are the characteristics of those 16 random queries?
How were the queries for the watdiv-btf, watdiv-stf, DBpedia-btt workloads selected? what are their characteristics? How many queries were included in watdiv-stf?
Did the evaluation of the proposed typed family partitioning excludes queries that do not include triple patterns with rdf:type predicate? Did it include queries without star patterns?
How is possible that 128 clients had 156GB of RAM in a machine with 2TB of RAM? It seems like there was not enough memory to ensure that each client could execute its queries without affecting the resources of other clients
When LSQ is mentioned in Section 7.1.3 (e.g., line 45, page 22; table 4), does it refer to DBpedia queries?
Given the values of \tau_{c_{low}} and \tau_{c_{high}} in line 43 (page 24), the experiments seems to pick a very specific set of classes. How were these values decided? do these parameters need to have the same value? how different would be the results for different values of these parameters?
Could the explanations given in lines 25-31 (page 27) be more precise? How is the contribution of each reason?
HTTP caching was used for TPF request (lines 37-38, page 27), would a similar caching of popular families improve smart-KG's performance?
Explanations in lines 40-43 (page 28): how is the contribution of each reason?
Why does the last part of section 7.5 focus on smart-KG stating that "smart-KG potentially needs a higher client RAM than the compared systems" when figure 8a seem to show that smart-KG2020 transfers more data and potentially uses more client RAM?
Which of the client specifications were used for Section 7.6? Were these experiments performed using WatDiv1B? If yes, how were the results? If no, why?
How were the queries shown in Figure 9 selected? why to use different sets of queries for "unbounded" and "bounded"?
Line 39 (page 30) states "typed-family partitioning has no influence on the queries of the watdiv-btf_{unbounded}", but in Figure 10, we see that the results are different, in particular for 10M and S6. If that is not related to typed-family partitioning, then what other differences are there between the two systems that are compared?
The description of DBpedia-btt_{bounded} given in section 7.1.3 says that it includes 8 queries, but figure 11 includes 19 queries. How can both be true? Do queries in Q1-Q19 include the same query more than once?
The results for queries Q1-Q19 presented in figure 11 could be complemented with information about the data transfer and number of requests of each of these queries.
Explanations in Section 7.6.2 suggest that intermediate results have a huge impact on performance, but at the same time it is not so clear how representative the queries in that experiment are in terms of diverse combinations of intermediate results. Would the results be very different for another set of queries?
Why is it okay to include the following restrictions in the proof of proposition 1: Q is composed of a pair of triple patterns, variable-free predicate p (do both triple patterns have the same predicate?), p is included in P'_G?
Why does the proof of proposition 1 refer to equation 14? wouldn't \mu_G(f) be used?
The authors do not seem to meet the requirement to provide a URL to a file with resources. A link to a repository was provided, but it was not possible to access the content without requesting an account.
Possible typos / minor comments:
- The end of line 42 on page 2
- In line 44 (page 3), should it be "subj(t), pred(t), obj(t)"?
- n, m in lines 43, 44 (page 7) were meant to be the same variable?
- is the set inclusion in line 23 (page 8) an equality?
- what is s(Q) in line 25 (page 8)
- triple patterns queries (line 8, page 9)
- (?s, p, ?o) was meant to be (?s, ?p, ?o)? (line 39, page 9) (line 23, page 10)
- "bound rdf:type predicates" -> does bound really refer to the predicate? or to the object? (line 8, page 12)
- "decomposes the input query into a set of o star-shaped subqueries" (lines 18-19, page 13)
- m, m' in lines 37-38 (page 13) were meant to be the same variable?
- the line numbers used when describing the algorithms, e.g., lines 4, 5 seems to be incorrectly referenced when describing Algorithm 1. For Algorithm 2, lines 8, 9, 11, 13, 14, 15, 17, 18 seem to be incorrectly referenced. For Algorithm 3, line 6 seems incorrectly referenced
- Revise lines 2, 3 of Algorithm 1. Are f and F'_i the same?
- "all families in G could prohibitive" (line 35, page 15)
- (iiii) --> (iv) (line 33, page 16)
- "the smart-KG materializes" (line 21, page 17)
- "how smart-KG server" (lines 31-32, page 17)
- Are the two Q's in different font styles in line 37 (page 17) the same?
- The statement in footnote 18 is not true (Algorithm 1 as presented in page 15 doesn't do that)
- "Plan(Q) is shown in Alg. 2" (line 14, page 18) but Algorithm 2 describes Plan(Q, P'_G, C'_G)
- "characterizes an potentially" (line 21, page 18)
- line 19 in Algorithm 2 seems to be misaligned
- "(i.e." is not closed (line 30, page 20)
- "the same thresholds use in" (line 12, page 22)
- "a real-world dataset such as DBpedia" (lines 27-28, page 22) --> is it DBpedia? or another one?
- "the percentage of timeouts drastically increases for TPF... from 44% in 1-client workload to 56% with 128 clients" (lines 32-33, page 26) --> it doesn't seem drastic
- The paper suggests two version of smart-KG, so it seems unnecessary to have more than two names for them (e.g., smart-KG, smart-KG2020, smartKG 2020, smartKG, smartKG 1.0)
- "with a better geometic in the" (line 32, page 28)
- "to consider the two execution shipping partition and brTPF based on cardinality estimations" (line 23, page 32)
- "First, the optimizer devises an ordering of the star-shaped queries Qi and the triple patterns within Qi (Alg. 2, lines 3–7)" --> but Algorithm 2 does not describe how is the ordering of triple patterns within Qi
|