Reproducible Query Performance Assessment of Scalable RDF Storage Solutions

Tracking #: 1592-2804

This paper is currently under review
Dieter De Witte
Laurens De Vocht
Jan Fostier
Filip Pattyn
Kenny Knecht
Hans Constandt
Ruben Verborgh
Erik Mannens

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
Applications in the biomedical domain rely on Linked Data for an increasing number of use cases spanning multiple datasets. Choosing a strategy for running federated queries over Big Linked Data is however a challenging task. Given the abundance of Linked Data storage solutions and benchmarks, it is not straightforward to make an informed choice between platforms. This can be addressed by releasing an updated review of the state-of-the-art periodically and by providing tools and methods to make these more (easily) reproducible. Running a custom benchmark tailored to a specific use case becomes more feasible by simplifying deployment, configuration, and post-processing. In this work we provide a detailed overview of the query performance of scalable RDF storage solutions in different setups, with different hardware and different configurations. Tools to simplify the exploration make this work more easily extensible and renewable. We show that single-node triple stores benefit greatly from vertical scaling and proper configuration, but horizontal scalability is still a real challenge for most systems. Alternative solutions based on federation or compression still lag by an order of magnitude in terms of performance but nonetheless show encouraging results. Furthermore we demonstrate the need for query correctness assessment in benchmarks with challenging real-world queries. With this work we offer a reproducible methodology to facilitate comparison between existing and future query performance benchmarks.
Full PDF Version: 
Under Review