SPARQLES: Monitoring Public SPARQL Endpoints

Tracking #: 1262-2474

Authors: 
Pierre-Yves Vandenbussche
Juergen Umbrich
luca matteis
Aidan Hogan
Carlos Buil-Aranda

Responsible editor: 
Jens Lehmann

Submission type: 
Tool/System Report
Abstract: 
We describe SPARQLES: an online system that monitors the health of public SPARQL endpoints on the Web by probing them with custom-designed queries at regular intervals. We present the architecture of SPARQLES and the variety of analytics that it runs over public SPARQL endpoints, categorised by availability, discoverability, performance and interoperability. We also detail the interfaces that the system provides for human and software agents to learn more about the recent history and current state of an individual SPARQL endpoint or about overall trends concerning the maturity of all endpoints monitored by the system. We likewise present some details of the performance and usage of the system thus far.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 20/Jan/2016
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

This is a re-submission of a major revision that describes the SPARQLES system for monitoring the maturity of public SPARQL endpoints. I have read through the new manuscript as well as the attached response to reviews. I am happy with most of the responses, and it is particularly good to see the system live again and be able to test the described functions.

It is particularly good to see the more detailed impact descriptions and about the system itself. However, I still feel the submission could benefit from some very minor clarifications.

In my original review I raised the question about the origin of the 4 chosen criteria. This has been much better clarified in the resubmission. However, I feel the the response to review has given an even more honest answer to this question and readers could benefit from those information, particularly the following paragraphs

- That said, while the resulting dimensions were not the result of an empirical study or a user survey, we feel that these four dimensions provide a comprehensive overview of the maturity of a given public SPARQL endpoint.
- While we consider these dimensions comprehensive and useful, we do not claim that they – nor the tests we perform to partially quantify them – are complete. We do believe, however, that they are useful and important for the community to be aware of. We also remain open to suggestions from the community with respect to new types of aspects or analytics to perform.
They do not need to be included as they are, but it is good for the readers to be aware of the open-end nature and inspire future research on this topic.

Secondly, in the sustainability section, the authors could add some more discussions about the sustainability of the service/system as well as the code base. It seems clear about the code base which is open accessible. But how about the SPARQLES system itself? What happens if funding runs out to host the system? Is there anything to prevent anyone to host a mirror of the system? Is there a robust mirroring system to sustain its availability?

Finally, as a reviewer who has read through the paper both as a user and a developer, I think there is still a lot of room for more usability studies as well as functionality studies. The analytics data presented in section 7.1 could be biased, because the “availability” pane is the first one presented in the front page, which may naturally lead to more traffic. The colored icons are helpful, but there is no way to order the endpoints in any columns. A way to combine these criteria to search for or prioritise alternative endpoints could also be very useful. The system could provide a fruitful playground for rigourous HCI studies to provide a tool truly useful to the community. But do the authors have the capacity in their future work plan? The authors mentioned that feedback are managed in the open issue tracker. But without any knowledge about future funding, it would be good if the authors could expand a bit more on future plans in making the tool more usable. The authors could also consider including some the relevant feedback from the users as the appendix in the manuscript, in order to the make the argument more complete.

Review #2
By Ivan Ermilov submitted on 03/Feb/2016
Suggestion:
Accept
Review Comment:

The issues from the previous review of the paper were addressed by the authors to the full extent.

Minor issues, which should be fixed in the final version:

* "In Section 2, we first discuss..." on page 3. It would be better to use cleveref latex package (see http://www.howtotex.com/packages/automatic-clever-references-with-cleveref/) and have a proper capitalization in this paragraph.
* "In Figure 3, for example..." on page 10. Again capitalization.
* "The interface is implemented using various Javascript libraries, including Node.js ..." on page 10. Node.js is not a Javascript library. It is JavaScript runtime. This sentence has to be fixed.

Review #3
Anonymous submitted on 18/Mar/2016
Suggestion:
Minor Revision
Review Comment:

In general, the revised version of the paper is improved. It now fits the system tools category. Nevertheless, I have the following concerns which should be addressed in a second revision:

I think that reporting the (precise) results of previous work (reference 9) is unnecessary in the introduction and should be removed. Why are the results important to motivate the system?

One of the other reviewers raised the question why you query the SPARQL endpoint instead of a simple ping. I would suggest to add an argument against a ping/head request to the availability section (like the one that can be found in response to the reviewers). This should clarify why the availability is tested by querying the endpoint.

Moreover, text in the screenshot of the homepage (Figure 3 on page 11) is cut off. Therefore I would suggest adding a (larger) screenshot of the homepage without cutting off the text.

The authors argue for the sustainability of the system regarding the storage by looking at the growth of the compressed database dump. This is not completely correct since the compression ratio is also an important factor. The database can grow hundreds of giga bytes while the compressed database dump stays the same in size. Furthermore, the authors should point out why it is important that the database fits on a 1TB disk (current SSD sizes are roughly 1TB, which can handle the high IO performance required by a database; and can be extended using RAID 10; why is there a 1TB limit?). In my opinion storage requirements are not a problem at all because MongoDB scales horizontally using sharding.

The line "the most common user interaction starts on the homepage […] this can be seen from first […] data row of Table 1" should be rephrased. It cannot be seen from the first row, it is the highly likely interaction derived from the statistics.

In section 7.1 the impact of the tool is discussed by listing works using it. To highlight how the tool is applied by the research community it is essential to mention which parts of the system are employed in the respective works.
In my point of view, the citation count argumentation for reference 9 should be removed.

Minor remarks:

- p. 4: A list should have at least two items
- p. 5: "Schedule" should be "Frequency" to match the structure of the other subsections
- p. 10: Figure 4 & 5 => Figure 4 and 5
- p. 11: Figure 7 is mentioned here but is actually on page 15; can all tables and figures be rearranged? (to place them near their mentions; use subfigures?)
- Figure 6 (p. 14) and 8 (p. 15): only plot the line, no need to fill the area
- p. 15: Figure 7: axis labels are missing; please reduce colors of box plots

off-topic:

Have you considered to register the domain sparqles.org? I think the current domain is too long.