Review Comment:
This paper presents the nuts and bolts of LSQ 2.0, which is the new version of the already very useful LSQ 1 dataset, containing many SPARQL queries from a diverse set of sources.
In my opinion, LSQ is a beautiful initiative that is very useful to the research community. I understand that the amount of work that goes into
- designing the dataset;
- making it uniform and easily accessible; and
- getting all the query logs together
is not to be underestimated and I highly appreciate the authors' commitment.
A particular point that I believe is very interesting for researchers is (F3), the runtime query statistics. Some freely available query logs do offer timestamps or the information whether a query timed our or not (e.g. the Dresden Wikidata logs), but the authors take the effort of re-running all queries locally and producing valuable information on query execution. In some cases, this may be even more informative than original information from the SPARQL endpoints, because it is independent of the current query load of a specific server.
I recommend this paper to be accepted and only have a list of minor comments.
Comments
========
p2 left 18: "...as these logs can reveal patterns in how users incrementally build their queries." Notice that [A, Section 10] indeed studies the evolution of queries over time. This seems to be an extended version of [19]. The study was only done on DBPedia log files though.
p2 right 20: LSQ has also contributed to the motivation of [B, C, D, E] (some indirectly through [A] and [21]). I don't mean that the authors should mention all this work, but query log analysis was very crucial to the motivation of some of these papers; and the authors may want to take a look to see which examples are worth mentioning.
p3 left 13: I want to add that the work [B] is very explicit about using query logs to motivate a new aspect of queries to be studied that have been largely overlooked in the literature. In this case, threshold values in queries. Coincidentally, the paper even uses the phrase "in the wild" in its title.
p3 right 21: distill
p4 right 32: ...the query that may be of interested to our identified...
p4 right 45: What is a salted IP? (Some random data that one adds before the hash?)
p4 footnote 4: ... for every dataset D. ("Any" is ambiguous. Can be read as existential or universal.)
p6 left 34: You can add that your approach of adding runtime statistics on the queries may give a more accurate estimate of the query's difficulty because it is not dependent on the server's workload when the query was executed. Your way of adding runtime statistics is very valuable, in my opinion.
p7 left 20: Vocabuaries -> Vocabularies
p7 right 15: Too much whitespace before footnote 11.
p7 footnote 8: Could you explain what the 5th star is? It makes the paper more self-contained.
p8 right 47: I have the feeling that the 45 million from the abstract, the 44.0 million here, and the 43.9 million on p2 of the paper are all referring to the same number. Please make them uniform. (Or at least, please uniformly use the same number in the same precision. If it's really 44.0 million, I'd say "44 million" in the abstract and use "44.0 million" throughout the paper.) Alternatively, i.e., if these numbers are supposed to be different, then please add an explanation why.
The same remark holds for the 11.6 million here and the 11.55 million in the abstract.
p9 left 14: Is this the 1st occurrence of BGP? If so, please also mention the term "Basic Graph Pattern".
p12 right, Table 4: Please add the decimal separator in the "Prop. Path Features" columns to improve readability (and uniformity with the columns further left).
p16 left 41: Paper [D] also squarely fits in this category.
[A] Angela Bonifati, Wim Martens, Thomas Timm: An analytical study of large SPARQL query logs. VLDB J. 29(2-3): 655-679 (2020)
[B] Angela Bonifati et al.: Threshold Queries in Theory and in the Wild. CoRR abs/2106.15703 (2021)
[C] Wenfei Fan, Ping Lu: Dependencies for Graphs. ACM Trans. Database Syst. 44(2): 5:1-5:40 (2019)
[D] Diego Figueira et al.: Containment of Simple Regular Path Queries. KR 2020: 371-380.
[E] Anil Pacaci, Angela Bonifati, M. Tamer Özsu: Regular Path Query Evaluation on Streaming Graphs. SIGMOD Conference 2020: 1415-1430
|