Review Comment:
In the revised paper the authors significantly improved verbosity and clarity of the presentation of the results, fixed errors, typos as well as some flaws and removed redundant and implementation-specific information.
Amongst others, they improved the model pictures/examples and included a listing giving an overview over the queries for all models w.r.t. the GET operation, both helping readers not familiar with the models to understand and compare the models on a structural level. The plots were significantly improved resulting in a much better readability and the format is now optimized to compare the actual models for the same list. Furthermore, they added verbose descriptions of the figures, to be more self-explaining and summarizing the observed behaviour of the operation. The likert categories are defined by clear thresholds highlighted in the plots.
Moreover, they also extended the experiment by lower-half indexed queries and compared them to the higher-half queries.
Unfortunately, some issues raised in my previous review with regard to the design of the experiment (esp. the queries) remain or we misunderstood.
The benchmark queries now presented in Fig. 6 seem (still) not formulated in a pragmatic way and are a bit detached from reality. I think a person knowing about SPARQL and some basic quad store internals would write the queries differently and probably way more naturally/efficiently in order to achieve the specific purpose of the operations (esp. the get operation, still having the API use case in mind). In order to reduce personal bias in this matter, I asked 2 colleagues of mine to judge the shown queries. We came to very similar drafts of how the queries are supposed to look like (see further details below).
Although the authors tried to clarify and reframe the positioning of their work, it is hard for me to understand the actual objective of the revised paper. The authors mention that they intend to “compare observations” to study the “usefulness” of the sequence models. They state that they do not intend to “measure the cost of each operation”. But in fact the conducted experiments observe and compare the query performance of the individual operations for the different models and list sizes and classify them according to 5 likert categories. The datasets which contain only one list each are synthetic and seem to be designed to study exactly these costs in a way isolating the different comparison dimensions. Adding to the confusion the core research questions are defined as: R1: How to model lists *efficiently*? and R2: How do the operations *perform* in current models? I understand that it is not the objective of the revised paper to evaluate the performance of the databases itself and therefore the database configuration as well as the queries do not need to be fine-tuned for each system individually (e.g. usage of list:member SPARQL feature in JENA). I now also understand that the authors do not focus on inspecting optimal ways to model/encode these lists in RDF and comparing optimizations (e.g. adding a pointer in the RDF list entity to the end of the list to speed up append operation, or storing the number of elements for a list) although R1 could imply to do exactly that. From my point of view, it definitely is a research gap to accurately measure and compare the query time costs / query performance for the defined operations (even without testing further optimisations of the models itself to make them more efficient) and the paper in its current form definitely does/can address this.
However, sound/appropriate queries (optimized w.r.t. to the nature/benefits of the models and minimal in a sense that they do not perform unnecessary computation) are essential for meaningful “observations”. In my opinion, the current (inconvenient - from my/our perspective) realisation of the queries in Fig. 6 having harder complexity than necessary weaken the usefulness and trustworthiness of the work/experiment and as a consequence the subsequent performance results and conclusions can be questioned.
Although I do not share some of the assumptions and arguments (aiming at my comments from the previous review) made by the authors (e.g. that mixing query runs and databases from the models have an advantage over isolation to observe the performance), I consider the substantial missing step to being “accepted” for the SWJ issue is reconsolidating and fixing/optimizing the queries. I don’t know the exact number of queries (I assume at least a portion of the index-based queries) which need to be reworked, but from the perspective of the paper (concept and manuscript) itself I estimate that this would be only a “minor revision”. However, I acknowledge that repeating (parts of) the benchmarking procedure with improved queries and repeating the manual analysis, could be a “major” amount of work esp. if the observations/conclusions would vary significantly due to potential complexity class changes of the queries.
________________________________________________
Figure 6 - “minimal query” issues:
a) The minimal query should look like the SET query`s WHERE part. No substring+int-casting+filtering should be needed (filtering can be faster in some cases then a JOIN though. A proper comparison would need to study both approaches, but I understand that this is not the focus of the authors research at the current stage, and I agree that the effects are likely to be minor due to the low number of triples and sequences of the datasets in general. )
b) The casting to integer is in question. A “minimal” query could use the string representation directly?
c) Sorting is an expensive operator. I don’t see a reason why the entire List should be sorted to retrieve only one element whereas the index is known and materialized? Having that in mind, casting to integer would be obsolete.
d) An exact length property path should be the perfect fit for that kind of query. I don’t see a reason why the kleene star should be used to materialize the positions for ALL events in the List. I do not understand the claimed necessity for the subquery?
e) Same as in d): the path lengths can be determined based on the requested index.
Fig. 12/13 The fact that 13b) 1k performs much better (more than an order of magnitude with small std. deviation) than 500 is still curious to me, now even more given that this effect can not be observed for the low index query.
What is meant by “results are equivalent”?
The changed y axis scale for 12d) makes it hard to compare between 12c) and and 13d). I suggest using a consistent scale for the plots of one operation.
p5l31 I don’t understand why the axiom is valid if and *only* if Q is empty
p10l14 “falicitate”
p10l17 “predicable”
p11l44 “as klong”
p14l19 “the item in to”
|