LSQ 2.0: A Linked Dataset of SPARQL Query Logs

Tracking #: 2866-4080

Authors: 
Claus Stadler
Muhammad Saleem
Qaiser Mehmood
Carlos Buil-Aranda
Michel Dumontier1
Aidan Hogan
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Philippe Cudre-Mauroux

Submission type: 
Dataset Description
Abstract: 
We present the Linked SPARQL Queries (LSQ) dataset, which currently describes 45 million executions of 11.55 million unique SPARQL queries extracted from the logs of 27 different endpoints. The LSQ dataset provides RDF descriptions of each such query, which are indexed in a public LSQ endpoint, allowing interested parties to find queries with the characteristics they require. We begin by describing the use cases envisaged for the LSQ dataset, which include applications for research on common features of queries, for building custom benchmarks, and for designing user interfaces. We then discuss how LSQ has been used in practice since the release of four initial SPARQL logs in 2015. We discuss the model and vocabulary that we use to represent these queries in RDF. We then provide a brief overview of the 27 endpoints from which we extracted queries in terms of the domain to which they pertain and the data they contain. We provide statistics on the queries included from each log, including the number of query executions, unique queries, as well as distributions of queries for a variety of selected characteristics. We finally discuss how the LSQ dataset is hosted and how it can be accessed and leveraged by interested parties for their use cases.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 06/Sep/2021
Suggestion:
Reject
Review Comment:

Collecting query logs and providing them to the public for research is certainly a great idea. However, apart from the additional datasets, the contribution of the authors with respect to their previous work [69] is unclear. There seem to be only minor changes, e.g. in the data model / RDF representation and the statistics. A very important contribution would be to provide natural language interpretations of these queries.

Detailed comments:

- The natural language question would also be very helpful to make the dataset more interesting.
- There should be more descriptive, textual meta data to easily look for queries and datasets. E.g. How do I search for queries and datasets about actors or airplanes?
- The authors write “query can thus have multiple executions within a single endpoint, but the same query string issued to n endpoints will be considered n different queries.” If query q1 is executed n times, it should still be considered as one query but with n execution paths.
- How were the query logs of all 27 datasets collected? Did the users manually type the SPARQL queries or did they use predefined queries or templates?
- What was the actual contribution of the authors for collecting and processing the queries?
- Table 5 should also contain min and max runtimes.

Review #2
By Martin Necasky submitted on 09/Sep/2021
Suggestion:
Minor Revision
Review Comment:

The manuscript was submitted as 'Data Description'. It describes a linked dataset of SPARQL query logs (LSQ), version 2.0. LSQ 2.0 features logs from 27 SPARQL endpoints and describes 43.9 million SPARQL query executions on these endpoints. The paper is an extension to a paper "LSQ: The Linked SPARQL Queries Dataset" published on ISWC 2015. The submited extended version has the same structure as the previous paper. Both papers consider the same set of use cases for the LSQ dataset and both introduce the same data model for representing query logs. The reviewed paper describes significant extensions compared to the first version from 2015. This includes a significant extension with more datasets, queries and query executions, and more detailed statistics. Also a broad usage of the dataset is presented in a structured form which transparently shows various recent research works which used LSQ for the evaluation. The paper shows only the usage of LSQ 1.0. However, this clearly proves the usefulness of the dataset. The new version as a significant extension to the first version will undoubtedly find a broad usage as a pool of queries for various benchmarks.

In summary, even though the paper does not bring anything ground breaking compared to the paper from 2015, it shall be published in my opinion. It is necessary to inform the research community about the new version of the dataset. The paper contains all necessary information the readers need to start using the dataset.

I have two minor issues related to the information provided about the dataset in the paper.

I1) Metrics and statistics on external and internal connectivity of the dataset: The SPARQL endpoints are described in a LSQ specific way. However, existing descriptions of SPARQL endpoints expressed as VOID resources could be linked instead.
I2) Use of established vocabularies: This is related to the previous issue. On page 4, Query instance, the authors describe that the originating endpoint of a query is associated with the dataset using a LSQ specific property lsqv:endpoint. The reason is that void:endpoint or sd:endpoint could not be used because they have different domains. It is a strange reason. For example, a LSQ query can be associated with a void:Dataset using a LSQ specific property. Then it would be possible to use void:endpoint (https://www.w3.org/TR/void/#sparql). I understand that it is not possible now to change the vocabulary. However, the rationale behind this design decision should be justified better.

The paper demonstrates the necessary level of the maturity of the dataset, including its quality, stability, and usefulness proved by the usage of the dataset by many researchers as a benchmark for evaluating their results. The description of the dataset is at the necessary level of detail. I have two minor issues related to the clarity of the description.

I3) Page 4, query instance: The property lsqv:endpoint is described in the text with lsqv:Query as a domain. However, Fig. 1 shows that the domain of lsqv:endpoint is lsqv:RemoteExec. The textual description of the LSQV vocabulary is inconsistent with the figure.

I4) Page 4, static features: It is described that static features are defined for a query, independently of the dataset over which the query is evaluated. However, the paragraph above describes that queries are considered only in the relation to an endpoint. In other words, an instance of lsqv:Query is always a pair of SPARQL expression and endpoint. LSQV does not allow to describe features of queries independently of the endpoints. Therefore, the paragraphs "Query instance" and "Static features" on page 4 are in a contradiction.

Regarding the LSQ data file

(A) It is well organized and available for download as well as via a SPARQL endpoint through an access page http://lsq.aksw.org/. The access page contains necessary information about the access to the dataset.

(B) The provided resources are complete for replication of experiments. The access page provides also tools for the experiments and their documentation.

(C) LSQ is not available through a public repository such as GitHub, Figshare or Zenodo. The authors use their own web server which seems appropriate for long-term discoverability. LSQ does not have a DOI.

(D) The provided data artifacts are complete. The LSQ dump is structured to files for individual SPARQL endpoints. The content for each endpoint seems complete.

Review #3
Anonymous submitted on 04/Oct/2021
Suggestion:
Accept
Review Comment:

This paper presents the nuts and bolts of LSQ 2.0, which is the new version of the already very useful LSQ 1 dataset, containing many SPARQL queries from a diverse set of sources.
In my opinion, LSQ is a beautiful initiative that is very useful to the research community. I understand that the amount of work that goes into
- designing the dataset;
- making it uniform and easily accessible; and
- getting all the query logs together
is not to be underestimated and I highly appreciate the authors' commitment.

A particular point that I believe is very interesting for researchers is (F3), the runtime query statistics. Some freely available query logs do offer timestamps or the information whether a query timed our or not (e.g. the Dresden Wikidata logs), but the authors take the effort of re-running all queries locally and producing valuable information on query execution. In some cases, this may be even more informative than original information from the SPARQL endpoints, because it is independent of the current query load of a specific server.

I recommend this paper to be accepted and only have a list of minor comments.

Comments
========

p2 left 18: "...as these logs can reveal patterns in how users incrementally build their queries." Notice that [A, Section 10] indeed studies the evolution of queries over time. This seems to be an extended version of [19]. The study was only done on DBPedia log files though.

p2 right 20: LSQ has also contributed to the motivation of [B, C, D, E] (some indirectly through [A] and [21]). I don't mean that the authors should mention all this work, but query log analysis was very crucial to the motivation of some of these papers; and the authors may want to take a look to see which examples are worth mentioning.

p3 left 13: I want to add that the work [B] is very explicit about using query logs to motivate a new aspect of queries to be studied that have been largely overlooked in the literature. In this case, threshold values in queries. Coincidentally, the paper even uses the phrase "in the wild" in its title.

p3 right 21: distill

p4 right 32: ...the query that may be of interested to our identified...

p4 right 45: What is a salted IP? (Some random data that one adds before the hash?)

p4 footnote 4: ... for every dataset D. ("Any" is ambiguous. Can be read as existential or universal.)

p6 left 34: You can add that your approach of adding runtime statistics on the queries may give a more accurate estimate of the query's difficulty because it is not dependent on the server's workload when the query was executed. Your way of adding runtime statistics is very valuable, in my opinion.

p7 left 20: Vocabuaries -> Vocabularies

p7 right 15: Too much whitespace before footnote 11.

p7 footnote 8: Could you explain what the 5th star is? It makes the paper more self-contained.

p8 right 47: I have the feeling that the 45 million from the abstract, the 44.0 million here, and the 43.9 million on p2 of the paper are all referring to the same number. Please make them uniform. (Or at least, please uniformly use the same number in the same precision. If it's really 44.0 million, I'd say "44 million" in the abstract and use "44.0 million" throughout the paper.) Alternatively, i.e., if these numbers are supposed to be different, then please add an explanation why.
The same remark holds for the 11.6 million here and the 11.55 million in the abstract.

p9 left 14: Is this the 1st occurrence of BGP? If so, please also mention the term "Basic Graph Pattern".

p12 right, Table 4: Please add the decimal separator in the "Prop. Path Features" columns to improve readability (and uniformity with the columns further left).

p16 left 41: Paper [D] also squarely fits in this category.

[A] Angela Bonifati, Wim Martens, Thomas Timm: An analytical study of large SPARQL query logs. VLDB J. 29(2-3): 655-679 (2020)

[B] Angela Bonifati et al.: Threshold Queries in Theory and in the Wild. CoRR abs/2106.15703 (2021)

[C] Wenfei Fan, Ping Lu: Dependencies for Graphs. ACM Trans. Database Syst. 44(2): 5:1-5:40 (2019)

[D] Diego Figueira et al.: Containment of Simple Regular Path Queries. KR 2020: 371-380.

[E] Anil Pacaci, Angela Bonifati, M. Tamer Özsu: Regular Path Query Evaluation on Streaming Graphs. SIGMOD Conference 2020: 1415-1430