Reproducible Query Performance Assessment of Scalable RDF Storage Solutions

Tracking #: 1784-2997

Dieter De Witte
Dieter De Paepe
Laurens De Vocht
Jan Fostier
Ruben Verborgh
Erik Mannens
Hans Constandt
Kenny Knecht
Filip Pattyn

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Applications in the biomedical domain rely on Linked Data spanning multiple datasets for an increasing number of use cases. Choosing a strategy for running federated queries over Big Linked Data is however a challenging task. Given the abundance of Linked Data storage solutions and benchmarks, it is not straightforward to make an informed choice between platforms. This can be addressed by releasing an updated review of the state-of-the-art periodically and by providing tools and methods to make these more (easily) reproducible. Running a custom benchmark tailored to a specific use case becomes more feasible by simplifying deployment, configuration, and post-processing. In this work we present in-depth results of an extensive query performance benchmark we conducted. The focus lies on comparing scalable RDF systems and to iterate over different hardware options and engine configurations. Contrary to most benchmarking efforts, comparisons are made across different approaches to Linked Data querying, conclusions can be drawn by comparing the actual benchmark costs. Both artificial tests and a real case with queries from a biomedical search application are analyzed. In analyzing the performance results, we discovered that single-node triple stores benefit greatly from vertical scaling and proper configuration. Results show that horizontal scalability is still a real challenge to most systems. Semantic Web storage solutions based on federation, compression, or Linked Data Fragments still lag by an order of magnitude in terms of performance. Furthermore we demonstrate the need for careful analysis of contextual factors influencing query runtimes: server load, availability, caching effects, and query completeness all perturb the benchmark results. With this work we offer a reusable methodology to facilitate comparison between existing and future query performance benchmarks. We release our results in a rich event format ensuring reproducibility while also leaving room for serendipity. This methodology facilitates the integration with future benchmark results.
Full PDF Version: 

Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 15/Jan/2018
Major Revision
Review Comment:

The paper has improved considerably and in general, the authors have done a good job addressing previous concerns (particularly with respect to the black-box comment, with more detail provided at least on what sorts of features in queries lead to problems; also with respect to query correctness and caching). On these points the authors should be commended. In general, I remain positive about this paper and commend the obviously significant amount of labour that has been invested to generate the given results and achieve the described insights.

However, my main concern for the paper is first and foremost the writing (and presentation). Though the structure and writing has improved considerably, there is still considerable room for improvement: the paper is still a tough read. I believe this is primarily due to three factors (1) the first is what I would call "the curse of knowledge" (using lingo familiar to the writer in a way that can be confusing to the reader); (2) poor flow of ideas at times, (3) minor errors in English, (4) a lack of attention to detail. The problems are quite subtle but frequent enough to make me have to stop and re-read various parts (sometimes several times). Just due to these issues in writing, I am going to recommend a Major Revision: though the paper has improved considerably, I really feel that the writing needs to improve dramatically before acceptance.

I will try to give detailed comments, but this are really only examples of the types of problems I encountered; my suggestion is a more general revision (line-by-line) of the text. Given that there are nine authors on the paper, I am confident that for the next version, the writing can be improved considerably.

## Detailed comments ##

I will mark more important comments with @

* "and to iterate" -> "and iterating"

* "querying[;] conclusions can be ... actual benchmark costs." What is a cost here? I believe this is economic cost?

* "facilitates -the- integration"

* "choice, +and+ which are optional?" This is also weirdly expressed. I presume the intent is to refer to the features that are essential to the application/use-case, not the query system "of choice".

* In Challenges, how is (iii) different from (i)? In general (iii) seems to obviate the need for the other questions.

* In Challenges, point (iv) is also weird.

* "we fill" -> "we will"

@ "Vendor" vs. "SemWeb". This lingo is strange and confusing. FluidOps is a vendor. Virtuoso et al. are SemWeb systems. Also I do not buy into this "research prototype" distinction; the difference between TPF and Virtuoso is more than that: they are completely different architectures. The continued reference to "compression" also makes zero sense to me: as I (and another reviewer) mentioned previously, all systems implement compression. The real difference between these two categories seems to be that one refers to end-to-end SPARQL databases, the other to "wrappers" for federation (perhaps this does not best characterise the difference either but the current "categories" must be clarified and named more appropriately).

* "in section 4.1[;]" (Frequent error: have a look at Prefer semi-colons when the thing after a comma is an independent clause.)

* "mapping workloads [to]"

* "we -both- study scalability +not only+ in terms ..."

* "distrib+ut+ed" (Run a spell-check!)

* "Federated systems consist-s-" (rather: are configured with?)

* "with \textit{N} the number of slaves"

@ "When the benchmark successfully terminates ... derived views are online." Again this sort of writing is too verbose and redundant. Essentially it tells me that a CSV file is used, but then it turns out that the CSV file doesn't contain enough info, so a CSV file is not used, instead a log records what you'd expect a log to record, which is richer than the CSV file. Honestly, I'm not interested: I trust that the authors can log results from a query run. My suggestion is to leave Table 6 and say that that's the information that's logged and point to the homepage for more info. (I am using this more as an example of a way that the writing can improve: there are other parts of the text that likewise could be "tightened up" considerably).

* Footnote 25: I do not follow: how does vertical partitioning better demonstrate the abilities of a federated querying system? What does vertical partitioning even mean for an RDF graph? Put the subjects on one machine, the predicates on a second, and the objects on a third? (I don’t understand.)

* "decided to release the query-set (project website)" A link to the website would be better.

* "Query features: …" This paragraph is neither text nor data. Please revise.

@ "While WatDiv queries are all BGP queries, these 1,223 queries are rich in SPARQL features ..." Again I take this as an example of poor flow of writing: the reader is naturally going to think that the 1,223 queries are the WatDiv queries and be puzzled by the inherent contradiction. Simply replacing "these 1,223 queries" -> "the 1,223 DISQOVER queries" would save the reader *considerable* confusion.

* Footnote 26: The reference does not mention "non-conjunctive queries" anywhere. I still have problems with this term, which is synonymous for me with "non-join queries": the queries still contain joins/conjunctions. Why not say more directly what you would like to say? "A large fraction of the queries also use features such as union, optional, etc., making them even more challenging"?

* "unbound triples" What does this mean? I guess it means all three terms are variables?

* "performance differences between systems are most often reported using the median runtime. [...] In the following box plots we chose to report both." This is really confusing.

* "Response times" Worth defining explicitly.

* "interprete"

* "are shown [for]"

* "10M, 100M, and 1000M (million) triples." [We begin by running tests with 32GB of memory.]

* "However, in the 32GB setting only Vir1_32 is performant, with a median runtime of 18.6." There is only a 32GB setting so this is confusing; I think maybe it should read "in the 1000M setting". Also I think you are referring to the mean, not the median value.

* "Bla1_32 competes with Gra1_32 for batch workloads" It took me a while to realise that "batch workloads" refers to mean runtimes. Please just write "mean runtimes" to reduce complexity (in all such cases).

* "Moving from the left panel to the right ... we clearly see the results converging." The reader is naturally going to look for results converging from the left to the right when what is meant is that the results converge at 64GB for the different systems!

* Figure 3: Why is there no Flu1? (Also why is Flu* called Flu* and not the more recognisable FedX*?)

@ Figure 5 is still a nightmare. Given five series, choosing pink, orange and red is not optimal in terms of readability, particularly given so many (thin) bars. Also Figure 5 is not referenced from the text. Please improve the readability of the figure (breaking it up into two rows if needed), improve the colours, and reference the figure from the text.

* "Since these SPARQL-on-Hadoop ..." I was confused by this discussion since real-time SPARQL on Hadoop is a terrible idea; however, S2RDF is using Spark, which has distributed main memory (and a lot less latency). I would recommend to be more specific here.

* "Often not reported" By whom?

* "This pattern is very stable" What pattern?

* "small preference [for]"

* Figure 7: units on the axes. Caption: "for ES1_64" -> "while ES1_64"

* "the runtime of the +single-threaded+ warm-up run" (Just a hint for the reader.)

@ "As we will see in the next section however, caching only plays a role for the slow-running queries (C-templates) in the case of Blazegraph." This is clearly not true! What I realise afterwards is that this is just poor flow and the intent is to say: "However, as we will see in the next section, between these two systems, caching only plays a role for the slow-running queries (C-templates) in the case of Blazegraph."

* "the speedup with respect to the slowest query execution" How precisely is this speedup computed? Is it guaranteed that the speedup must be with respect to a previous query (and thus really due to caching)?

* Figure 8: x-axis, units.

* Figure 8 caption: "comparing all query runtimes in the multi-threaded run with the slowest execution in the stress test". This is confusing. What is the "stress test" (I thought it was the multi-threaded run?) What do you mean "comparing all query runtimes"? Do you plot points for each repeated execution?

* "As was explained in section 3.2, [...] are related." I don't understand what point is being made here.

* "[bears] a lot of similarity"

* "in+to+ the behaviour"

* "do currently not" -> "do not currently"

* “Only Virtuoso simulations had a sufficiently wide benchmark [...]. In Figure 9 [...]" You are discussing results before you introduce the data! This leads to confusion: where did you get the first observation from?

* "Vir3_64_Opt_0"/"Vir3_64_Opt_2" Maybe I missed it, but I still have no idea what the "_0" and "_2" refer to.

@ "Incorrect Queries": The queries are fine! It's the results that are incorrect. "Incorrect Results".

@ "However, the cluster did not exhibit ..." Totally lost here; please revise.

* "higest"

* "load c[o]st is in fact larger"

@ "as it might stimulate the further development of current prototypes ..." But you are not comparing like with like in this paper! Comparing TPF/FedEx (prototypes) with Virtuoso and friends is comparing apples and oranges!

* "have a different impact [on] the systems"

* "This support[s] the advice"

@ "iLab.t provided the high memory infrastructure to compress the benchmark datasets" Is this matter discussed in the paper? If not it should be. (My guess is that this is for the HDT dictionary?)

Again I just wish to summarise that I am in general positive about this paper, but the writing needs significant improvement. The above list should be considered incomplete and more an example of the types of problems in the writing that made the paper difficult for me personally to read. My suggestion is to revise not only the above points, but the writing of the entire paper. I hope that the above examples will help to give an idea of the types of issues that need to be addressed.

[Just as a side remark on count queries, "However, no inconsistencies were found there.", in this paper (, we found something similar: aside from 4store, if the engine returned a count, that count could be trusted.]

Review #2
Anonymous submitted on 23/Feb/2018
Major Revision
Review Comment:

The authors present an in-depth evaluation of different RDF storage systems.

The evaluation primarily uses WatDiv at 3 different scale levels and compare commercial triple stores, a FOSS triple store and federated setups.
It is clear that a lot of work was put into the creation of those benchmarks and presents an up-to-date performance evaluation.
Additionally, the authors provide the benchmark results in a git repository, which allows an independent verification.

In general, i have the impression that the paper touches to many aspects of benchmarking (vertical, horizontal, federated, caching,...) without providing in-depth discussion of the underlying causes. In many cases only aggregated observations are presented.

The main contribution of this paper is a run of WatDiv at 1B triples at optimal configuration.
In contrast to the previous version, this part was greatly improved.

The other runs/discussions all have issues, as follows:

# Major issues with the evaluation:

## Inconclusive answers to the research questions

RQ1: Reproducibility
You state by using cloud services and using either AMIs or docker images you enable reproducibility.
I do not see the novelty in this approach or even call it a benchmark methodology, especially as trade-offs of this approach are not considered.
Wrt to your response: instead of the research groups using what is at the moment in their datacenter, amazon is now using, what is currently in their data center, sold under moniker of the vcpu, the computational equivalent of
This may however introduce variance, as for example discussed in [1].
Also, instances might vary in the CPUs feature set, such as supported AVX-Version.
Additionally, the configuration of the disk system seems to be default, but what does that mean?
Can you expect similar results in the google cloud?

Demanding publication of results and configuration is also state-of-the-art, as for example defined by BSBM [2].
Further, it is not clear how you achieved the order-of-magnitude increase in performance in GraphDB through the RFI.

As before, i would say that you follow established best-practices, and for just executing a benchmark this would have been enough, but i do not see how this section lives up to the claim of presenting a benchmark methodology.

RQ2: Options and tradeoffs
I was expecting a nuanced, scenario specific answer here, however you just boil it down to cost, which is effectively runtime cost + licence cost. What are my options? What are the tradeoffs?

RQ3: Factors of query performance
The abstract mentions server load, availability, caching and query completeness as factors. I find those less helpful than the one given in
Section 3.1.

Those factors evaluated and discussed individually, and the selection of benchmarked scenarios is reasonable.

I highly recommend to differentiate for query type for the results presented in fig 1/2/4 or at least examine which queries are especially sensitive to changes in the parameters.

I disagree that query runtime is an oversimplified representation of performance, as you state in your in your conclusion. Put into context this what constitutes a good benchmark. I suppose this is just imprecise wording.

RQ4: Transfer to the real world

I agree with your conclusion, that synthetic result are not easily transferable.
Given a completely different data set with hugely different queries i do not find this surprising.

## Benchmark dimensions (continuing RQ3)

### Dataset size and memory scaling
are related in a linear fashion: twice the amount of memory can hold twice the data, or at least that could have been a hypothesis to check.
Memory is doubled, data size is increased 10-fold in the evaluation.

I suppose the rationale behind scaling the benchmarking those dimension is to show, when non-linear effects kick in. For reducing complexity, i would recommend scaling only one of those dimensions, but more fine granular. Only examining the average/median might be misleading here, as query operators might scale also non-linearly.

Virtuoso was not really tested here, due to its compact memory layout,
Scaling on the other system was only performed with non-optimal settings.

### Horizontal scaling

The benchmark essentially only tests virtuoso, but not in a meaningful way.
ES and TPF have error/timeout rates so high, that basically they fail to execute the benchmark.
For TPF this is not unexpected, as it already fails the 1B WatDiv similarly. By sharding the data, i would not expect a different result, as the client is most likely the bottleneck here, though this is not discussed in the benchmark.
You state (in the response letter) that this setup was of interest for ontoforce. I fail to see, why it should be a suitable candidate.

But also, as you state, the virtuoso benchmark has the wrong setup, as the dataset size is too small.
So, as the others are failing anyway or running into timeout after timeout, you could have very well increased dataset size.

This section in summary contains no meaningful benchmark run.

### Configuration
The takeaway message is that blazegraph and graphdb are more susceptible to configuration than virtuoso.

It is a pity that the dataset/memory scaling was performed using the default configuration.
I do not see a conclusive reason to test non-optimally configured version in 4 Benchmarks each.

A missed opportunity here would be to introduce for example an local ssd and see how it helps the spill-to-disk stores.
This would also been interesting for the price/performance comparison.

## The caching section
I do not find conclusive evidence on caching in this section and i disagree with its methodical approach.

If you want to benchmark the effects of caching in the sense that the same query is executed multiple times, the proper way is to both execute repeated and unique queries in different runs and compare them.

Fig. 7 and Fig. 8 compare the warmup run with the stress run. As in the warm up run the buffers might (depending on the system) not be yet filled, a speed up is expected. This is why you call it warm up, right?
If you refer to this as caching, please improve naming/ give a definition.

Also, form a caching i would expect that relative performance gain increases with query runtime as caches usually allow constant time retrieval, but it actually lowers with larger runtimes. With large gains only reported on fast queries, i would assume that this could be attribute to general query runtime variation.

Further: You write, that TPF benefits most from caching due from NGINX. I find this hard to believe.
First, the correct way of benchmarking this would have been to disable caching and using the TPF server natively.
Second, TPF is most likely limited on the client side, where this form caching should have no effect. Please closely evaluate where the bottleneck is.

For testing the effects of data locality (if it resides on network/disk/ram/cache), i propose to repeatedly test a) the same query b) the same query template and c) the same query type.

# Minor issues

## Discussion of the results

The discussion of the results focuses too much on the aggregated data. This is exemplified by Figure 5, which is not discussed at all and not referenced in the text. Figure 6 however is an aggregated view of Figure 5 and shallowly discussed.

What are the properties of the queries that make them run slower on certain systems?

## Relation Watdiv / Ontoforce benchmark

I find the OF benchmark results still gives little insight to other researchers.

While it is certainly interesting to see which features of a query can cause the query to fail, this is much more interesting to its vendors, which however, would need access to queries and dataset.

The release of the query set (easy to miss, please put it into github) in combination with the raw data allows for some more detailed analysis, due to the raw presentation of the queries, this actually puts some high barriers to do so.

## Vertical Scaling / Calculation of values presented in the diagram

In 4.2 you state that Memory is no magic solution: Virt1_64_Def improves by a factor of ~1.5, which is not easy to see, due to the log scale, i disagree with your statement there. It would be helpful to report total benchmark time in numbers, not just on a log scale graph.

## Value calculation

I really tried to figure out how much the performance gain is, so i checked the raw data you published. It appears that Fig 1/2/4 are calculated on the basis of the median values of the benchmark runs. Why is that so, and why was not simply the raw data taken? As of now, this seems to be the median / average of median values.

## Raw data in the git hub repository
Following the links on your companion website, the raw data in the github repository indicates, that the benchmark was run using different repetition counts. While this might not be a problem, i find this odd and contradicting the text. However, I might have just looked into the wrong files, as there seems to be some stale data (correct/rerun/rev1_13_....).

## Benchmark survivability

I disagree that a system "survives" a benchmark, if ~80 % of the queries time out, but your text explanation is appreciated.
I just ask myself, with that in mind, what does Fig.3 tell me?

Additional Notes:

Wikipedia links in the footnotes are unnecessary

Figure 6: BGP type -> Watdiv query template type

Figure 8: I have a problem understanding the setup: Did you compare the warmup with the stress test? you write that you compare the multithreaded run with the stress test, but those are the same, right?
Further: Runtime has no unit, i assume (s).

[1] Jörg Schad, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. 2010. Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc. VLDB Endow. 3, 1-2 (September 2010), 460-471. DOI=


Review #3
Anonymous submitted on 03/Mar/2018
Minor Revision
Review Comment:

The paper presents results of extensive query performance benchmark that was conducted for different available scalable RDF systems. The abundance of such systems, tools, benchmarks, etc, make it very hard for domain engineers to be able to choose an adequate configuration of the systems and hardware.
In this work the authors focus on different hardware options and engine configurations, and different approaches to Linked
Data querying which allow to compare the actual benchmark costs.
The analysis is performed both on artificial tests and a real case with
queries from a biomedical search application.

The analysis presented is very thorough; all configurations, parameters, optimizations, and other practical decisions are clearly stated and explained, in accordance with the main goal pursued by the authors of providing a reproducible benchmark. The manuscript is well written, self explainable, and easy to follow.
The authors propose in the beginning four research questions that aim to answer with the benchmarking process.
These are very natural and important questions that need to be analyzed in order to provide engineers with a useful spectrum of the possibilities for setting up these type of complex systems.
Among the four questions, RQ3 (RQ3 What factors have an impact on the measured
performance and is this impact consistent across different RDF solutions?) is for me one of the most important and possibly
the more oversimplified one in previous practical demonstrations of RDF systems. One very important conclusion the authors make (also quite natural and expected) is that query time is not fairest measure that can be used. In many works query correctness and consistency is not studied and clearly it can make a significant difference when choosing what system to use in practice.
Furthermore, the authors also released the set of scripts for deployment and speedup the post-processing of the presented results.

I think the work presented here is very important for the Semantic Web community and should be used as a starting point of how to produce (re)usable and really useful (in real world applications) benchmarks. I list below some minor comments and typos that I would like the authors to address and look at.

Particular comments and typos:

- Page 2, column 1 to the end: "an approach we fill further extend in this
work." --> I think there is a typo there; the sentence does not seem to make sense otherwise.

- Page 2, column 2: "All benchmarks involving Virtuoso on the Ontoforce data have therefore
been duplicated" --> Since it is the very first time these systems are mentioned, I think a reference should be added.

-Page 3, column 2, section 2: "...therefore reveals
the ability of of the internal.." --> the word of is repeated.

- In several places the authors use "..." I think that in many cases "etc.," would be a better option.

- Section 3.4: "Summarized, the dataset consists
of 2.4 billion triples spanning 107 graphs. PubMed,
ChEMBL, NCBI-Gene, DisGeNET, and EPO are the
largest graphs with PubMed already making up 60%
of the data." --> None of these graphs are not publicly available?