qEndpoint: A Novel Triple Store Architecture for Large RDF Graphs

Tracking #: 3507-4721

Authors: 
Antoine Willerval
Dennis Diefenbach
Angela Bonifati

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Abstract: 
In the relational database realm, there has been a shift towards novel hybrid database architectures combining the properties of transaction processing (OLTP) and analytical processing (OLAP). OLTP workloads are made up by read and write operations on a small number of rows and are typically addressed by indexes such as B+trees. On the other side, OLAP workloads consists of big read operations that scan larger parts of the dataset. To address both workloads some databases introduced an architecture called differential updates. Precisely, changes are accumulated in a write-optimized delta partition while the rest of the data is compressed in the read-optimized main partition. Periodically, the delta storage is merged in the main partition. In this paper we investigate for the first time how this architecture can be implemented and behaves for graphs. We describe in detail the indexing-structures one can use for each partition, the merge process as well as the transactional management. We study the performances of our triple store, which we call qEndpoint, over two popular benchmarks, the Berlin SPARQL Benchmark (BSBM) and the recent Wikidata Benchmark (WDBench). We are also studying how it compares against other public Wikidata endpoints. This allows us to study the behavior of the triple store for different workloads, as well as the scalability over large graphs. The results show that, compared to the baselines, our triple store allows for improved indexing times, better response time for some queries, higher insert and delete rates, and low disk and memory footprints, making it ideal to store and serve large Knowledge Graphs.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 28/Aug/2023
Suggestion:
Major Revision
Review Comment:

The authors present an graph database based on principles of the relational database world.
They claim to leverage differential updates as referenced by the database systems SAP HANA and C-Store.
Unfortunately, the authors do not explain what exactly they mean by "differential updates".
HANA and C-Store are both column-oriented databases, but do not explicitly use the term
"differential updates" for their architectures - despite the common delta strategy for updates on the database. Therefore, the article would greatly benefit
from a basic explaination what RDBMS strategy they copy for the graph database.

The description of the qEndpoint architecture is well motivated with using HDT for the main
partition and RDF4J for the delta partition.
Select, Insert and Delete are described as trivial, but concurrent transactions are not discussed. This should be included for a revised version.
Apparently, the critical part seems to be a merge operation as the delta partition is not able to scale. In addition to the already published HDTCat, they introduce HDTDiff which seems to be a natural derivative of the previous work.

A large part of the article takes up the experiments section. The authors use two different benchmarks:
Berlin SPARQL and WDBench. Unfortunately, it is not discussed why these are chosen - there are some others available, e.g the DBpedia SPARQL Benchmark.
Also, the authors only describe general statistics of the benchmarks but not specific characteristics, as e.g. complexity of queries (which might lead to the high timeout errors?).

The proposed architecture seems to be very fast in loading the data and querying time. At which it has to be admitted that the comparison is not that fair regarding the research focus of some of the concurrent architectures. For instance, QLever is a graph database specialized on text retrieval.
But, overall the experiments do look impressive. Especially, the little size of the index.
There is one aspect, that the authors do not discuss but seems to be severe: the high amount of
timeouts of qEndpoint on the WDBench dataset. What queries are affected?
Why are they so significantly high compared to the concurrent databases?

While the motivation and focus of the proposed architecture seems to be valid, the article itself
is premature and has some parts which seem to be adopted from previous (rejected?) submissions,
as e.g. Section 6.4: What demonstration are they referring to?
The article has several missing words and commas and requires at least a major revision before publication - despite the flaws described above.
The authors did not mention it explicity, but apparently their work is motivated by the necessity of fast queries on large graphs for QA scenarios. KGQA does require similarity searches and a b+-tree is not sufficient for proximity searches or a similarity function as e.g. included in PostgreSQL (to name a relational DBMS). It would be interesting to know how other index strategies could be included in their architecture despite the b+tree.

Review #2
Anonymous submitted on 22/Sep/2023
Suggestion:
Major Revision
Review Comment:

Overview:
This paper presented the qEndpoint, a triple store that implements a differential-update architecture for graph databases, by combining a read-optimized main partition (HDT) and a write-optimized delta partition (RDF4J). The performance of the triple store is evaluated over the BSBM and WDBench benchmarks. The experimental studies demonstrate that the combination of HDT and the delta is efficient and does not introduce an overhead. Furthermore, when compared to other existing systems, qEndpoint is overall comparable with other endpoints while reducing memory and disk footprint.

(1)Originality: This work is an extension of two previously published papers. Compared to previous work, two additional evaluations are added to this study. These augmentations are sufficient for the paper to be considered as a journal version.’

(2) Significance of the results: A key aspect of such a study is to measure performance in terms of both read and write (insert/delete) behaviour. However, only the first evaluation considers the write-related performance, while for the remaining two evaluations, it’s unclear whether write-related query workloads are considered, and related discussion is completely missing. Specifically, I have several questions that require clarification with respect to the evaluation:
1. In the first evaluation with BSBM: What’s the rationale behind not including a naive HDT as an additional baseline approach? The evaluation seems to only use a naive RDF4J as a baseline approach.
2. Concerning the second evaluation of WDBench, qEndpoint achieves much better performance than others, however, the paper lacks a clear description of whether write-related performance metrics were measured. In addition, I’m a bit sceptical about the results as the comparison is not conducted under an identical environment. Instead, it employs a ‘similar’ setup and draws the results from another paper directly.
3. In both the second evaluation with WDBench and the third evaluation using Wikidata query logs, comparisons are made among gEndpoint, Blazegraph and Virtuoso. It's worth noting that in the second evaluation, qEndpoint outperforms others significantly, while in the third evaluation, qEndpoint only outperforms on roughly half of the queries. Could you provide an explanation for this observation between the two evaluations?

(3) Quality of writing: There are several instances where the writing lacks clarity and could benefit from improvement:
a. In Table 9, could you elaborate on how the ‘%outperf.’ metric is calculated? Does it correspond to the distribution presented in Fig.3? Please provide further details regarding the information contained in Table 9.
b. On page 11, you mention, “We report the results in Table 4,” without providing a context or specifying what results you are referring to. Specifically, it’s unclear what you mean by ”average time.” Does each query run only once, and then you calculate the average time for each query type?
c. On page 14, there is another vague description: “The results are presented in Figure 3 and Table 10.” It is unclear what results are being referred to in this context.
d. On page 9, you refer to finding “the anonymized git repository with the experiments at [21]”, while the reference [21] points to WDBench.
e. In your explanations of some observations, you mention “query optimization problems” (once on page 10, another on page 14). Could you elaborate on what kind of query optimization problem you are referring to behind these explanations?

Data file provided by the authors:
(A) The source code of qEndpoint is provided on GitHub, containing a well-organized README file.
(B) However, the configuration files for evaluations and experiment results are not provided, which makes it impossible to determine whether the experiments can be replicated or not.

Review #3
By Miel Vander Sande submitted on 05/Oct/2023
Suggestion:
Minor Revision
Review Comment:

Synopsis

This paper presents a triple store architecture to allow differential updates: a main data partition optimised for reads is accompanied by a data partition optimised for writes. The write partition contains the delta and is merged in the main partition whenever it becomes too large and inefficient. The authors use the self-indexed compressed read-optimised HDT format by Fernandez J. et. al. to implement the main partition and RDF4J native store to implement the write-optimised partition. The system is called qEndpoint.
This is a very interesting, down-to-earth approach to creating an efficient, yet simple triple store. It, among other things, contains a very good, but concise explanation on how HDT works. The architecture is sound, but still needs a lot of clarifications. The paper, however, is OK written and could use a language and structure review for better readability. Therefore, I recommend a minor revision.

Questions and comments
- The introduction talks about graph data, but since this system is RDF-only, it should make this clear from the start. This will also make the paper more focused.
- When reading the paper, many ideas remind me of OSTRICH (https://web.archive.org/web/20190223143009id_/http://pdfs.semanticschola...) While they have different goals, the work on versioned triple stores seems relevant for these types of architectures. Similarly, I expected a reference to X-RDF-3X: the extension to RDF-3X using differential indexes https://www.researchgate.net/publication/220538435_x-RDF3X_Fast_querying...
- p3 l41: what additional index is added? Could you briefly clarify?
- p3 l35: the used notation is actually never introduced. For instance, what does I or B mean?
- p3;l48: what are the similarities with the SAP HANA architecture? The authors write that this inspired qEndpoint, but don’t say how.
- The authors almost handwavingly refer to Figure 1, but the architecture is not actually properly explained. This makes the remaining explanation on select, insert and delete operations rather confusing.
* By p5, it’s not entirely clear what types of queries the qEndpoint system (aims to) support: triple pattern queries (like HDT) or full SPARQL1.1 ? I was thrown of by the sudden mention of FILTER on p5, line 39. This is clarified later on though.
* Do I understand correctly that triples are stored in RDF4J by their HDT id? How is this done? What modifications did you need to make?
* And what are all these other components such as Web/UI and qEndpoint indexing method?
- The text explaining the SELECT optimisations does not read very well and could use a more structured, to the point writeup. Also, how does it avoid selecting deleted triples? I’m guessing it checks the DeleteBit somehow?
- Does HDTDiff compute the new HDT or do you need to run both HDTCat and HDTDiff?
- Could the authors elaborate on the how the memory mapping works with the bitmaps when performing the merge? How big can these bitmaps become?
- Does extensive use of HDTCat/DIff make the HDT less efficient, for instance because it get fragmented? How does that impact performance?
- The Berlin SPARQL benchmark is so old and so selective that the authors probably could have skipped it. Why not WatDiv or the LDBC SPB (https://ldbcouncil.org/benchmarks/spb/) benchmark?
- For better readability: highlight the relevant results in Table 4
- I’m not sure the loading data code is relevant here; better refer to git repo with instructions.

language & typos:
- p3;l4: we respectively chose…?
- p4;l28: rephrase the sentence
- p4;l50: rephrase the sentence “…highly compressing RDF…