qEndpoint: A Novel Triple Store Architecture for Large RDF Graphs

Tracking #: 3616-4830

Antoine Willerval
Dennis Diefenbach
Angela Bonifati

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
In the relational database realm, there has been a shift towards novel hybrid database architectures combining the properties of transaction processing (OLTP) and analytical processing (OLAP). OLTP workloads are made up by read and write operations on a small number of rows and are typically addressed by indexes such as B+trees. On the other side, OLAP workloads consists of big read operations that scan larger parts of the dataset. To address both workloads some databases introduced an architecture using a buffer or delta partition. Precisely, changes are accumulated in a write-optimized delta partition while the rest of the data is compressed in the read-optimized main partition. Periodically, the delta storage is merged in the main partition. In this paper we investigate for the first time how this architecture can be implemented and behaves for RDF graphs. We describe in detail the indexing-structures one can use for each partition, the merge process as well as the transactional management. We study the performances of our triple store, which we call qEndpoint, over two popular benchmarks, the Berlin SPARQL Benchmark (BSBM) and the recent Wikidata Benchmark (WDBench). We are also studying how it compares against other public Wikidata endpoints. This allows us to study the behavior of the triple store for different workloads, as well as the scalability over large RDF graphs. The results show that, compared to the baselines, our triple store allows for improved indexing times, better response time for some queries, higher insert and delete rates, and low disk and memory footprints, making it ideal to store and serve large Knowledge Graphs.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Miel Vander Sande submitted on 23/Feb/2024
Review Comment:

This paper introduces qEndpoint: a triple store architecture where a main data partition optimised for reads is accompanied by a data partition optimised for writes. The write partition contains the delta and is merged in the main partition whenever it becomes too large and inefficient. The self-indexed compressed read-optimised HDT format by Fernandez J. et. al. is used for the main partition and RDF4J native store is used for the write-optimised partition.

The revision fixes many of the issues I had with the paper:
- the authors added many clarifications and background to the paper. This significantly improves the readability for those who are not that familiar with the subject matter. However, I feel like the authors could have gone a little further. Also, some of the new (and old) text is sloppy and contains typos or spelling mistakes. I recommend carefully revising the language for the final version.
- there are references to OSTRICH and X-RDF-X. Also the introduction to SAP HANA improved. Wrt. OSTRICH and X-RDF-X though, I think what's being discussed is besides the point (e.g, only triple patterns or being unmaintained). More interesting are details on the architectural similarities.
- an additional experiment on merge time and a note on benchmarking, which motivates the choice for Berlin a lot better. That said, I still believe the choice of benchmarks is rather limited. Watdiv would have been interesting to see how triple pattern performance evolves with the size of the data/merge partition. With respect to merges: I wonder why the authors did not consider https://aic.ai.wu.ac.at/qadlod/bear.html ?

To summarize: I'm a bit dissapointed to see the authors focused on changing as little as possible> That being said, I think the papers was sufficiently improved in order to be published.

Review #2
Anonymous submitted on 20/Mar/2024
Review Comment:

It looks like my previous concerns/questions have been clarified, and the missing discussion has been added.