Reproducible Query Performance Assessment of Scalable RDF Storage Solutions

Tracking #: 1592-2804

Dieter De Witte
Laurens De Vocht
Jan Fostier
Filip Pattyn
Kenny Knecht
Hans Constandt
Ruben Verborgh
Erik Mannens

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Applications in the biomedical domain rely on Linked Data for an increasing number of use cases spanning multiple datasets. Choosing a strategy for running federated queries over Big Linked Data is however a challenging task. Given the abundance of Linked Data storage solutions and benchmarks, it is not straightforward to make an informed choice between platforms. This can be addressed by releasing an updated review of the state-of-the-art periodically and by providing tools and methods to make these more (easily) reproducible. Running a custom benchmark tailored to a specific use case becomes more feasible by simplifying deployment, configuration, and post-processing. In this work we provide a detailed overview of the query performance of scalable RDF storage solutions in different setups, with different hardware and different configurations. Tools to simplify the exploration make this work more easily extensible and renewable. We show that single-node triple stores benefit greatly from vertical scaling and proper configuration, but horizontal scalability is still a real challenge for most systems. Alternative solutions based on federation or compression still lag by an order of magnitude in terms of performance but nonetheless show encouraging results. Furthermore we demonstrate the need for query correctness assessment in benchmarks with challenging real-world queries. With this work we offer a reproducible methodology to facilitate comparison between existing and future query performance benchmarks.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 08/Apr/2017
Major Revision
Review Comment:

Comprehensively benchmarking RDF stores is a complex task. One must consider different shapes of queries, different operators in the query language, different types of data, different scales of data, different stores, different versions of stores, different configurations of stores, different types of caching, different memory requirements, different types of hard disk, different types of CPU, different types of replication, different types of distribution, different types of data partitioning, different types of results returned, different types of timeouts and errors, different types of metrics to output, and all this just to start with. Is it humanly feasible to make sense of all this diversity? What regions of this high-dimensional "benchmarking space" are most interesting to examine? Can we prune some paths? How can we best summarise points in this space into core conclusions and two-dimensional plots? The one thing then that is perhaps more challenging than doing a comprehensive benchmark is writing a paper on one.

This paper deals with benchmarking various RDF stores under a variety of settings. The core of the benchmark revolves around testing Virtuoso, BlazeGraph, GraphDB and an undisclosed fourth enterprise store (ES) under various permutations of (and unwanted intrusions from) the above space described above. The benchmark revolves around two resources: the WatDiv synthetic benchmark designed to provide a comprehensive set of BGP shapes in the form of queries and an associated data generation engine, as well as a proprietary Ontoforce collection of datasets and associated queries generated by the interface of one of their faceted-search products. The queries involved appear challenging and the scale of data ranges upwards to two-point-something billion triples in the case of the Ontoforce resource. The authors describe a variety of experiments on a platform of rented Amazon EC2 hardware under various configurations of memory, machines and RDF stores, providing a variety of comparative metrics to distinguish the performance of different stores. The conclusions are wide and varied, but a trend of the superior performance of Virtuoso over other stores establishes itself early and continues throughout the paper.

There have been a lot of papers on benchmarking RDF stores and there could be a lot more. Such papers are of course crucial for the community: they are key for understanding the maturity of Semantic Web technology and the feasibility of applying such technologies in practical settings, and they are key to understanding open problems and finding the right research questions in terms of querying performance. This paper, moreso than previous papers, makes a effort to explore more of the "benchmarking space". Unlike other papers I've seen, the authors look more practically at issues such as loading times, configurations, single-vs.-multi-node setups, benchmark costs on different hardware types, etc. These results provide rich insights, in particular, into the practical trade-offs and costs of using these technologies in "real settings", not just for running a query mix on whatever machine the researchers have available.

There are various positives about this paper I can highlight:

* The paper is clearly on topic for the special issue.

* The authors compare the performance of engines "out-of-the-box" (with default configurations) against configurations tuned by the vendors, which gives a great insight into not only the potential of these engines with some tweaking, but also into how "general" or flexible the optimisations implemented by different stores are.

* I also liked using varying machines on Amazon EC2, the presentation of financial costs (used for example in high-performance computing areas such as sorting, but not so much in the Semantic Web), and varying query loads. There is a certain practical/commercial focus throughout the paper that I like, perhaps as a result of this paper stemming from an industrial collaboration.

* The datasets selected offer a challenging scenario, with billions of triples, challenging queries, and a mix of more synthetic with more "real-world" queries.

* The sequence of performance questions and metrics that the authors ask (though not comprehensive -- they never could be) in general, seems to me, for the most part, to be appropriate and natural.

* I can appreciate (from first-hand experience) the vast quantity of hard labour that must have gone into getting these results, in terms of installing engines, configuring them, babysitting experiments, checking into weird results, changing something, running the experiments again, and so forth.

* I also appreciate that the authors make a fair effort to critically assess their experiments and to investigate and discuss counter-intuitive results.

* Though admittedly I have not checked the details, the authors claim to make various resources available online for reuse by other groups.

* There is also a subtle point here but I like that the paper is less about how great the benchmark is and more about exploring the performance of the systems. Many benchmarking papers spend most of their space trying to argue why their benchmark is the best benchmark ever and then quickly present some results; the majority of this paper focuses on performance results.

My general impression is that this is one of the most informative benchmark papers I've seen from a practical performance perspective.

However, there are various significant weaknesses as well.

The first is rather philosophical:

* The paper still has a feeling of being "yet another benchmarking paper", by which I mean take some (usual) engines, vary some configurations, run some experiments, and present some results and interpretation. The main issue with such papers is that they can only explore a certain amount of the "benchmarking space" and really, we do not get much closer to a core abstract understanding of performance for the primary reason that the really interesting conclusions are buried under the labels "Virtuoso", "BlazeGraph", "GraphDB", etc., which is to say, by treating these engines as a black boxes, we do not get much closer to understanding *why* these engines are faster, nor do we get any closer to knowing how to improve upon them. The main conclusion then, like the other benchmarking papers, is that unless you really need an open source engine, Virtuoso is the way to go. There are some interesting variations on this conclusion in this paper (e.g., is 32GB Virtuoso better than 64GB Virtuoso, and how important are the configurations, etc.) but I feel somehow so long as we run benchmarks considering such systems as black boxes, we're not really getting any closer to a real *understanding* of query performance or how far away the more poorly-performing systems are from the best-performing ones (e.g., maybe some minor optimisations could see them compete better?). All the same, the alternative is idealistic and challenging; I just wished perhaps for a little more *why* when interpreting the performance results in terms of what the underlying engines are doing. But to be clear, I am more than willing to accept a good "yet another benchmarking paper" and see a lot of potential in this one.

While I'm not sure what can be done about that weakness, I believe the following should be addressed before the paper can be accepted.

* I really like the complexity inherent in the paper in the sense that other benchmarking papers might be tempted to sweep some of the details fleshed out here under the carpet. I want to see the details. Unfortunately though, in the current state, I found the paper to be tough to read due to a certain disorder in presentation. At times, reading the paper felt like ...
"X performed best in these experiments with 32GB but Y came second so now we look at 64GB where figure 5 you saw two pages ago show the overall results (we didn't plot anything for 32GB even if you thought that this was where figure 5 was being discussed). Then we changed the configuration yet again and now Y performs best as you can see in the leftmost bars for the C queries but only in median results because if we consider the timeouts then X performs best, at least until we consider the optimised configurations. By the way, X implements a results timeout that Y doesn't, which explains an undetermined amount of why it performs better, but we'll only tell you that at the end of the paper."
Of course I am exaggerating but I mean for this comment to be constructive: the subjective experience of reading the paper, for me, was really like this. The paper needs to be thoroughly revised and structured so that the detail is preserved, but each new set of experiments follows a similar flow. I would suggest to have explicit sections/paragraphs and follow the same structure:
- What is the question we still need to answer that these new experiments
- Discuss the configuration (and new notation/legends as needed)
- Provide a plot and reference it (avoid discussing results for three paragraphs without a plot ... you need to give an overall picture and you have the space)
- Provide high-level observations for the plot (to sort of explain/confirm what the reader can see on a high-level) and high-level interpretation
- Extract interesting specific/new conclusions, possibly in a structured itemized list, with interpretation
Also the authors should consider the order of presentation of "issues". See the next point.

* There were a couple of issues that are core issues benchmarking but that are not (in my opinion) clearly addressed in the current structure.
- The first is caching. Reading the first few parts of the paper, there are mentions here and there, but it is not addressed in all cases, or at least when reading the paper I found myself asking, what about caching? For each experiment, I'd like to see a quick discussion on possible caching effects, even if just to say "we believe that caching cannot have an effect".
- The second is the more serious question of the correctness of results. This is mentioned in a few places in the paper in an abstract manner and I presumed that for all experiments, the result sizes (or result itself in the case of COUNT) was compared across different engines. However, at the end of the paper, it is revealed that Virtuoso had the default result limit of 100000 enabled and even after configuration, had an inherent limit of 2^20. This calls into question what results presented previously were affected by this and to what extent? I know it sucks to run all the experiments and only then find out this was an issue, but as it stands, the reader has no guarantee that the previous comparisons are "fair" or not, or to what extent Virtuoso is faster only because it didn't bother to return all the results. Just as a concrete example, in Figure 1 we can see that the performance of the optimised 64GB version *degrades* over the default 64GB version, which I had circled with a big question mark. Now I know why: the latter has a result limit of one million while the former has a limit ten times less. So now if I look at Figure 1 again, with this in mind, I have *no* idea how much the expert configuration of Virtuoso improves performance; furthermore, I have no real evidence as to how Virtuoso compares with other engines who (presumably?) returned all results or failed.
Such issues need to be dealt with in a timely manner, not at the end of the paper. Also, the issue of result correctness needs to be thoroughly revised and addressed: how much does it influence previous results? If the interpretation of those results is gravely affected, unfortunately I see no other option than to rerun experiments for Virtuoso (and whatever other engine is affected). If the interpretation of results is not gravely affected, the issue needs to be addressed upfront and for each experiment, the number of queries/results affected needs to be presented, and an argument given as to why the performance results are still valid. I comment the authors for raising the issue, but unfortunately it needs to be properly addressed.

* The paper puts the "reproducibility" of results first and foremost, but a lot of the results rely on a proprietary dataset and set of queries that are not made public. I understand the reasons behind this and perhaps there's nothing to be done about it, but I still feel it is a weakness worth mentioning.

* I found the presentation of TPF and FluidOps results to be sort of a square peg being shoved into a round hole. To be honest, I don't know specifically what the FluidOps configuration is doing. As for TPF (I don't need to explain this to the authors of course but ...) the whole reason for TPF is as an alternative to SPARQL in decentralised settings where clients are supposed to get their hands dirty, which is almost precisely the opposite of the current scenario. The provided justification of "let's test compression" is not convincing given that clearly all the other stores are also using various forms of compression. Also on a conceptual level, I do not like that people involved with TPF are involved in the experiments; it doesn't seem so independent or fair in that more expertise is on-hand in terms of installation and configuration than for other setups. The results are sort of interesting, but at the same time do not fit well in the paper. I think the authors need to either (i) keep the results but justify them better in the context of the paper and declare clearly that it's not really a third-party system, or (ii) remove the results.

* Finally, the conclusions are quite brief and a bit weak. Given all the results presented, I would like to see a little more summarised here in terms of take-home messages from the paper.

In summary, there's a lot I like about the paper and I fully understand the challenges associated with designing and executing such experiments. I believe the authors have done most of the hard work needed to realise a very nice paper discussing practical issues of performance that I would recommend others to read. However, before I could recommend it (or before I could recommend accepting it):

* The authors need to rewrite the paper and improve the structure and flow, particularly in the presentation of results. I provide some suggestions above I hope will be useful for that. Note that I like the level of detail, I'm just suggesting to apply more structure to the detailed discussion so the reader doesn't get lost.

* The authors need to better address issues of caching and query correctness in a more timely manner, possibly even for each experiment. In particular, the problems mentioned with Virtuoso may unfortunately require rerunning certain experiments (the results are currently incomparable without such a revision).

* The authors need to either remove the TPF et al. results or better justify their inclusion (and provide a disclaimer that it is not a third-party system).

* Extend the conclusions so that they capture more aspects of the paper.

On a side note, the changes required are significant enough that the revisions will need to be thorough to convince me to accept next time and I am not sure I've captured all comments since there's still many details I am confused about in the current presentation.


* The English is generally readable but there are a lot of typos that a spell-check will find (that I thus won't list).

Section 1:
* "This is a challenge" I don't like starting a section with a pronoun. What does "This" refer to?
* "when configured optimally?" Nit-picking but how can you know it's configured "optimally"?
* "to study scalable RDF systems" ... "[only] the enterprise systems" It's a strange contrast.
* "set weights [for]"
* "the full work"?
* "benchmark cost" What does this mean?
* "features of FEASIBLE" What is FEASIBLE, specifically?

Section 2:
* I think Table 1, while nice, could be cleaned up a little in terms of formatting and spelling (particularly the Remarks column); also I think maybe it's important somehow to emphasise the "recent" part a little more in that I was wondering, where is DBSB or Berlin or LUBM, etc.? (Also, our workshop paper (Hernandez et al.) used 14 queries, but the conference paper cited in the table uses a query generation method.)
* "non-conjunctive" Not the best name since the queries are still conjunctive.

Section 3:
* "choice [of]"
* I think when introducing Table 3, it's worth discussing a little the reasons for not disclosing ES, maybe even in a footnote: Did you ask? Did they refuse? Would they sue you?
* ",..." Put ", etc." in every such instance.
* The paragraphs "To simplify the deployment ... research topic was created.": I found these hard to follow. Are the details really important in the paper? Do they affect the conclusions? Could they be better explained or summarised? Maybe the details can be put on a webpage or appendix to not interrupt the flow?
* When introducing the datasets and queries, I would like to see an example or two.

Section 4:
* "[While] Virtuoso was clear[ly] first for ..." I don't like having such discussion without a plot to refer to. The picture painted is partial and hard to visualise; for example, I cannot really put "1ms" in context without a plot. Every discussion like this, unless very minor, should have a plot to refer to.
* "The runtimes for WatDiv1000M are shown in the leftmost boxplots of Fig. 1." But they're all WatDiv1000 results? You mean for the 32GB experiments? (Here I will stop giving detailed comments on such points; the authors just need to rewrite these sections and improve the structure. What results refer to what parts of what plots is currently unclear in the current presentation and it's a constant struggle to not get lost.)
* I found Figure 4 hard to follow in terms of what "benchmark survival means", why Flu3 is tested for 100M while Flu1 is tested for 1000M, what the 1 and 3 mean, what the Warmup/Stress thing means, etc. Most of the required info is in the paper somewhere, just in various different places, or hidden (for example) in between the two plots. Also why does Flu3 fail when Flu1 has no problem?
* "the documentation is very extensive" I don't follow the relevance (especially if Virtuoso provided the optimised configs).
* "HA-solution" What's that? (High Availability, I guess, but it needs to be explained.)
* "This observation, together with ..." I don't follow. Also don't you run additional benchmarks with that datasets?
* "we can conclude that both for compression and federation the Triple Pattern Fragments approach" I think you're being too nice. The fact that the system can cleanly time-out on most queries is not really a cause for celebration (well maybe a little).
* "a very high correlation of 0.88" What is this correlation precisely? Kendall's tau? Spearman's rho? How was it computed?
* "The instance cost of the AWS hardware ..." Are these hourly prices? Also it's more common to put the dollar sign before, e.g., $5.01.
* "C-templates" what are these? Complex templates?
* Figure 6 is a headache. There is no progression across series so the lines do not help. Two of the series are very very very slightly different shades of blue. Suggestion: use grouped bar charts (grouped by query) with distinctive formatting on each series; I think think bars can easily fit and will be much clearer.

* A little clean-up required: some encoding issues and HTML entities visible.

[Disclaimer: I have a minor CoI since I was involved in initial discussions with Ontoforce in the context of an industrial collaboration on a similar topic but I don't feel this affects my objectivity.]

Review #2
By Uroš Milošević submitted on 24/Apr/2017
Minor Revision
Review Comment:

The paper addresses the different options and the associated trade-offs when choosing a linked data infrastructure setup in the context of Big Linked Data, the consistency of different RDF storage systems across both artificial and real-world queries, and the questions of reliable and reproducible query performance benchmarking. The review of state-of-the art triple stores covers four commercial and three alternative systems, and investigates scalability both in terms of dataset sizes and hardware configurations (horizontal and vertical), with a focus on the domain of Life Sciences.

The article builds on top of earlier work, but also provides a rather extensive overview of related efforts and appears to be one the most extensive recent research papers covering the topic.

The commercial RDF store selection process is not convincing, as the end result does not appear to be fully in line with the described criteria. The inclusion of an "undisclosed enterprise store" is not a problem per se, but at least some background information should have been provided. The inclusion of "other" RDF stores, however, is commendable.

The text appears to be implying that the authors used Virtuoso Open Source for single node test, and the commercial version of the same triple store for multi-node benchmarks. It is not entirely clear why the commercial edition was not used in both tests. Also, even though only a 3-node configuration was supported by one of the systems, horizontal scaling tests for other stores could have gone beyond 3 nodes.

Overall quality of writing is good. The paper is well structured and easy to read. Some minor improvements are needed:
- Acronym for Basic Graph Patterns (BGP, page 3) is introduced before the actual explanation (page 9);
- Not all table headers are accompanied by clear references in the text (or captions);
- Commas are often being left out.

Review #3
Anonymous submitted on 13/Aug/2017
Review Comment:

The paper describes a benchmarking effort on several commercial and open source triple stores in single and cluster setups and as well as systems deployed in a federated setup.
The authors continue their efforts of two previous papers on evaluating WatDiv on cloud instances and the evaluation of the ontoforce benchmark.

The paper is motivated by the problem of finding the difficulty of finding the right triple store for a certain scenario. The related work section embeds the approach well with its peers. The rest of the paper discusses the setup and results of a series of WatDiv benchmark and ontoforce runs.

I would therefore summarize the main contributions as follows:(a) an evaluation of watdiv on multiple systems and scenarios,(b) discussion of the results under different viewpoints, (c) publication of the benchmark scripts and extension of a benchmark runner and (d) an evaluation with an closed, company internal query/dataset as a check, if the results with the synthetic dataset can be transferred.

In terms of originality, the systems, concepts and criterias presented in this paper can be found in other works, as the author themselves detail out in their related work section.
By combining those aspects of other systems and approaches they create something new and i find the work to be sufficiently original.

The quality of writing and the presentation of the results, should be improved before publication, beside issued in the references, typos and occasional grammatical errors i find the general tone of the paper to be too informal. But i found nothing that could not be improved with minor efforts.

Reading and analyzing the paper, quite a lot of issues came up and i doubt that the paper can be published as it is and i think that it requires a big effort in order

Main Issues

Dealing with query failures and timeouts
One of the main critiques with this paper is how with incomplete results, timeouts and errors is dealt. The seem to be incorporated into the results and graphs, without a clear marking.

Virtuoso, for example, is the fastest store by a big margin. However, with C3, virt only retrieves 1/40 of the results, bla however retrieves the full result, but takes 20 times as long.
I can only guess here: For bla c3 is by far the most expensive query, assuming virt and bla would both skip this query or virtuoso would be forced to paginate over the results the summary results would be significantly different.
While you state this is “not entirely fair” i have the impression you underestimate this issue.

Further, the real world dataset seems to have so many errors and issues, that i wonder how you can make any other statement than, virt runs in time x, the rest fails.
Please make also clear why certain queries fail, and what can be done to achieve at least a little comparability.
I think this should be made explicit, especially as you want to establish a methodology for reproducible benchmarks.

Same applies to the Flu and Tpf comparisons.

Contribution: Methodology unclear
The paper states in the introduction and contribution chapter, that it presents a methodology. While the authors approach is structured, i do not see a specific method behind it.
Please make your method (if there is any) more prominent by giving it a name and introduce it properly. Further, if the methodology is your main contribution, it should be demonstrated on more than one use case/benchmark.

You also state that you present guidelines, but i found none explicitly mentioned.
W.r.t. to the issues i see in this paper i suggest to explicitly define, what to do with query failures, what results are the most relevant and how they should be presented.

Research Questions and their answers
In general i find RQ1 to be sufficiently addressed, and a welcome addition to the lack of an executable environment of watdiv.

I find RQ2 to be the most relevant of this paper. However i did not find it sufficiently answered here.
Running a scaling experiment with just different values (32/64) is not enough, especially as Virtuosos (according to support) memory requirements. Finding out, when linear or non-linear scaling occurs will require more than one data point.
The different scales at which the benchmark were executed play only a little role in the paper.

Further, RQ3, the comparison with the real world dataset is lacking.
Without knowing what query types cause SPARQL endpoints to stop responding, what query types do not work, this part of the comparison is of little value for outside researchers.

Ontoforce Benchmark
I have serious concerns that the results gathered from the ontoforce dataset are of any value to outside researchers.
Both the dataset and the queries are closed and queries are vaguely described.
In a paper specifically dedicated to such an analysis, as you do in [12], this perfectly fine, however this is diametral to your claim of a "Reproducible Query Performance Assessment".

Further: the queries seem to have extremely long runtimes, but are generated by faceted browsing. How can that be? Does the user tolerate a 1200s timeout for his queries? Is this really an interactive session?

Also, you want to examine how results from the synthetic benchmark can be transferred to the real-world equivalent.
The benchmarks seem to have nothing in common, why was no closer related benchmark chosen, one where the query features and and dataset properties are closer related.

The timeout of 300s is too short. C3 with bla (the only system to fully answer that query) is having an execution time a bit below 300s, this provokes timeouts and failures.

Use of Median
In most of the paper median results are used to discuss performance gains. In the end however, cost is presented as measure.
This leads to strange observations: Gra_32 is faster than gra_64 in total, whereas the median improves. Or does this have to do with other errors?
You basically state yourself that for ETL runs median is meaningless, however you nevertheless use it. Why no simple average?

As you want propose a methodology this should be strictly defined.

Missing numbers
While the numerical results exist in the github repo, most of the information as to be extracted from the graphs, for drawing own conclusion.
Please present basic stats like overall runtime, amount of errors and similar for all benchmark runs.

Again, you propose a methodology, it should be clear, which values should be presented.

The federated Query Setup
While i applaude the effort to include a federation setup, sharding on the subject resource will let flu and tpf execute more or less the same retrieval operations.
This is setup differs from other other benchmarking efforts, like fedbench, which shards vertically.
Therefore, this benchmark assesses the clients ability to retrieve data and execute SPARQL queries in-memory, aspects like source selection and pushing joins do not play a role in a horizontal sharding setup.
The implications on setup and expected performance should be made clear.

I do not see how compressing the dataset representation has an influence on the memory resources. This is supposed to depend on the implementation of the SPARQL store.
To show a difference here, JenaTDB should be included here. To the best of my knowledge both TDB and Virtuoso use dictionaries, they compress too.

Further: Compression is presented as an aspect of vertical scaling. With SPARQL over TPF the join operations are executed in the client, which has no relation to HDT, i do not see a connection here.

Benchmark survival
With 71% of the queries failing, i do not agree on your conclusion that TPF survives the benchmark.
When comparing flu and and tpf, in both cases the source systems are stable, the client execution fails.
I would describe tpf client to be more gracefully failing than flu.
This is only about the survival figure, for its implications on the results, see the "Dealing with failures" section.

Minor Issues
Metrics used
Presenting the benchmark runtime cost is good way of comparing RDF systems.
For the ETL use case of [13] this works well. In case of the interactive queries of ontoforce, which were created during interactive database interactions this acceptable, but can be improved. A more interesting metric would be the cost associated with keeping query response time for a certain number of clients within a certain limit.
This would also shift focus towards median times, instead of averages or total runtimes.
This would also allow a more nuanced answer to challenges ii and iii, as two load types are supported.

Mismatch between Abstract/Introduction and the Rest of the paper
Most of the paper deals discusses WatDiv and its evaluation, however the paper starts with federated use cases over multiple datasets and the difficulty of finding the right triple store.

You also mention a feature matrix to overcome one of the challenges you mention at the beginning. While i think this is a good idea, this also gives away ES. There is an inherent conflict in making the selection process transparent and hiding the name of one the system.

Why is every query executed 5 times during the parallel run, instead of running individual queries.
While i think there is a high probability that this has no effect on the runtimes, i wonder why taking chances.

The figure subtext gives a short interpretation of the graph, the referencing text however gives the explanation. If a bar representing query execution speed is presented, i can immediately see from the graph, which is fastest. Please provide in the description the context, i.e. what kind of a run it is, the configuration, how the data points are aggregated.

Fig.6. please use swimlanes for query types.
Fig.2. the coloring scheme is not well chosen, with red indicating "error" and green indicating "good", but referring to different benchmark runs.

fig.3/4 Should be presented in a similar fashion, as they present closely related data, (linear vs. log scale; query/second vs. second/query ) , i find it difficult to relate those two charts together.

Some references & the ontoforce queries
The use of references seems sometimes arbitrary. For example [31] specifically describes optimizations of queries. As the reader, we do not know the queries and therefore which optimizations are not taken into account. Do you mean specifically those?
Same with "faceted browsing" [17,35].

What is an ETL run? You motivate your paper differently than [12], please introduce it properly.

Reusable Benchmark Component
This chapter contains a lot of implementation details and can be shortened.

what is a deeply nested query? subqueries?
Table 4: what is reference and vertical scaling

Sometimes a little too informal:
.. queries goes from 0.15% to…
small, big
exclamation marks

Formal Issues

Scenario naming: establish a consistent scheme, Default, for ex. capitalized, italics,....
Last paragraph of vertical scaling is hard to read.

Check your bibliography! At least the following entries have obviuos issues: 7,9,22,3133

do not use omissions….

join types -> i think this term is a bit overloaded: hash-join, left-join, or snowflake?

3.5. choice can be chosen