Review Comment:
In this article, the authors describe a benchmark for federated SPARQL query processing systems, as well as an executor for such a benchmark. The proposed benchmark uses real-world data and queries from the biomedical/pharmaceutical domain. The authors analyze the properties of the datasets and queries, and run the benchmark on three federated systems and a single-source setup.
Unfortunately, the authors did not declare that this submission is an extension of earlier work [a], in violation of SWJ’s guidelines which state that “the submitted manuscript must clearly state the previous publications it is based on.” [b] This, by itself, would be grounds for immediate rejection.
After having reviewed the article, I have mixed opinions. On the one hand, the presented work addresses a clear need of the current federated SPARQL landscape. On the other hand, the article describing it has various flaws, but many of them seem to be fixable. Hence, I’m opting for “major revision” rather than “reject” (notwithstanding my remark above that the submission does not comply with the journal’s policy).
My main question is: why is the proposed benchmark a good benchmark, what does “good” mean, and how do we know? Does the benchmark really test those things that need testing in SPARQL federated engines? I am missing a clear section on requirements engineering, and a validation afterwards. While I agree that having real-world data and queries is important, this alone does not make something a good benchmark. If we want to improve on existing benchmarks, we need clear criteria for improvement; otherwise, it is impossible to assess whether the proposed benchmark is indeed a necessity, or just nice to have.
The strengths of the article are:
– addresses a need within SPARQL federation
– real-world data and queries
– large datasets
The weaknesses of the article are:
– the benchmark _itself_ it not evaluated
– no comparison with ANAPSID
– the majority of queries do not work
– missing important discussions such as a requirement analysis
– there is no link to benchmark data or queries
– not well structured
– several sloppy and/or unfounded statements
I will return to the above points in the detailed comments below.
ABSTRACT
--------
– Why do you introduce a new benchmark? What are the problems with existing benchmarks?
INTRODUCTION
------------
– You explain the tension between generic and informative, but do not draw any conclusion from that. What does this mean for your benchmark?
– “KOBE” is mentioned without introduction.
THE OPEN FACTS PLATFORM
-----------------------
– Reference for Open PHACTS?
– “the complexity of defining efficient Linked Data queries” => does Open PHACTS do anything regarding the “efficiency” of SPARQL queries? (The type FILTER in Listing 1 would contradict this.)
– Open PHACTS offers an HTTP API, but not a REST interface. (REST interfaces would be self-describing and offer hypermedia controls.)
– What is the observable difference between IMS “as a service” and as “a materialized dataset”? For instance, when accessing http://dbpedia.org/resource/World_Wide_Web, an observer cannot (and should not) conclude whether the underlying data infrastructure is materialized or not; i.e., the Linked Data Web consists of resources, not services. So does it mean that IMS is not offered as a resource-oriented HTTP interface?
– It might be useful to provide a brief inline definition of “structuredness”. Why doesn’t Conceptwiki have a structuredness in Table 1?
DERIVING QUERIES FROM WORKFLOWS
-------------------------------
– Who has transformed the questions into SPARQL queries? If an external party, could you provide a reference?
– Structure-wise, the explanation of Q19 probably fits better in the previous section, in order to contain all domain-specific knowledge in a single section. In any case, it was hard to follow this mid-paragraph.
– Where are the queries published? I tried Googling a part of Listing 1, but this brought me to [a] instead of a dataset.
– The fact that no Open PHACTS workflow was defined does not seem a convincing argument for leaving out a question. The queries could have been manually created as well? As such, the paragraph starting with “Regarding workflow availability” also seems superfluous (and in any case does not provide arguments why the queries that _were_ included are good ones).
– The argument on IMS seems an unnecessary repetition of the previous section.
QUERY CHARACTERISTICS
---------------------
– “The typical flow of a federated query processing system consists of three phases” => reference needed
– Table 3 does not show the syntactic, but rather the structural point of view.
– How is the original organization in different named graphs?
– The term “graph annotations” is ill-defined and therefore confusing; please be specific when you mean the graph in which triples are organized. Similar for “graph annotations” in SPARQL queries later; they should be GRAPH clauses/keywords.
– typo: “on the other hard”
– The observation about the number of predicates is directly relevant to my remarks about an evaluation of the benchmark: do what extent does this property make your benchmark a good one?
– How was “potential contribution to the resultset” exactly measured? What is the relation to selectivity?
– Regarding the listed SPARQL features per query: this highly depends on how a query was generated. For instance, Q19 as shown in Listing 1 makes a crucial choice of placing a FILTER on the object of an rdf:type statement. Alternatively, this might have been expressed as a UNION of two rdf:type statements, without FILTER. In the latter case, federated engines might use class-based filters much more effectively. So then the question is what this query is measuring: an engine’s intrinsic source selection capabilities, or rather the optimization capabilities of a certain SPARQL implementation? And this is just a single question for a query that was given. I could not find the other queries online, but I imagine there could be more issues like this one.
EXPERIMENTS
-----------
– Why where those 3 engines selected? In particular, why was ANAPSID not selected, which is known to perform well on certain queries that result in large intermediate results in other engines [c]?
– On a similar note, why were Triple Pattern Fragments not considered, which have been shown to perform well in federated scenarios [c]? A short answer here could be that current TPF engines do not support all SPARQL features; however, the text mentions that other federated engines also has issues regarding feature support. (Disclaimer: I am an author of TPF work; I am not necessarily requesting its inclusion, but I think the question is relevant.)
– FedX “typically being faster” needs a reference; there is also evidence to the contrary [c].
– Results also depend on network latency, which was not mentioned.
– Regarding performance on the FedBench benchmark, it should be noted that this benchmark has been criticized for its bias toward exclusive groups, and more complex queries have been suggested that put fedX at a disadvantage [d].
– “stand-alone Virtuoso database” should probably be “a single Virtuoso instances containing all datasets”
– Why 4 computation nodes? There are more datasets.
– Which network setup? What is the influence of the network?
– Why would a “very large” query be “probably difficult”? This is not necessarily the case. Furthermore, this also brings me back to my main concern: what are we testing then? If the results strongly depend on the syntax used to express a query, we might be evaluating a certain implementation’s capability to perform very specific optimizations as opposed to the general strengths of federation engines. As we all know, many federation engines are currently research efforts, and research groups can typically not effort to focus on specific optimizations. Therefore, improvements on the benchmark would thus not necessarily mean fundamental improvements of the engine, which seems problematic.
– Structure-wise, the recommendations at the end of each subsection probably belong in the conclusion.
– The syntax errors that result from SPLENDID make the results rather non-interesting… wouldn’t it be more interesting to simply fix SPLENDID and continue from there?
– “exponential” needs a reference
– It would be useful to indicate the type of errors on Tables 5 and 6 for the sake of overview; also, the assignment of these types is hard to follow in the text.
– You verify the number of results, but do you verify the correctness of individual results?
– With relation to Virtuoso errors, what was the exact Virtuoso configuration used?
– It seems highly problematic that the majority of engines fail the proposed queries. It it is definitely interesting to have a couple of queries fail for clear reasons, as this sets a goal for a new generation of engines. However, given the high failure rates, it is unclear whether the proposed benchmark is a realistic next target goal, or rather a goal in some distant future. In other words, it is unclear what insights this benchmark delivers, and what we should conclude from it.
RELATED BENCHMARKS
------------------
– This section comes too late, given that many of its concepts are already used earlier.
– This section should also discuss federation engines, and justify the selection of engines.
– The FedBench complex queries [c] need to be mentioned.
– More generally: what exactly are the strong and weak points of existing benchmarks? This would provide the motivation for a new benchmark.
CONCLUSION
----------
– “To understand what new insights can be gained” => So, what are the new insights that can be gained?
– What are the conclusions drawn that have not been previously observed? They definitely belong in the “Conclusion[s]” section.
– Why would we need a “battery of benchmarks”?
– I appreciate the point on query loads; this domain should definitely broaden its focus from execution time only.
[a] http://ceur-ws.org/Vol-1700/paper-04.pdf
[b] http://www.semantic-web-journal.net/faq#q9
[c] http://linkeddatafragments.org/publications/jws2016.pdf
[d] http://ceur-ws.org/Vol-905/MontoyaEtAl_COLD2012.pdf
|