RDF Graph Partitioning: Techniques and Empirical Evaluation

Tracking #: 2187-3400

Adnan Akhter
Muhammad Saleem
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Guest Editors EKAW 2018

Submission type: 
Full Paper
Over the past few years, we have witnessed that the RDF data sources, both in numbers and volume have grown enormously. As the RDF datasets gets bigger, system's storage capacity becomes vulnerable and the need to improve the scalability of RDF storage and querying solutions arises. Partitioning of the dataset is one solution to this problem. There are various graph partitioning techniques exist. However, it is difficult to choose the most suitable (in terms of query performance) partitioning for a given RDF graph and application. To the best of our knowledge, there is no detailed empirical evaluation exists to evaluate the performance of these techniques. This paper presents an empirical evaluation of RDF graph partitioning techniques by using real-world datasets and real-world benchmark queries selected using the FEASIBLE benchmark generation framework. We evaluate the selected RDF graph partitioning techniques in terms of query runtime performances, partitioning time and partitioning imbalance. In addition, we also compare their performance with centralized storage solutions, i.e., no-partitioning at all. Our results show that the centralized storage of the complete datasets (no-partitioning) generally lead to better query runtime performance as compared to their partitioning. However, for specific cases the performance is improved with partitioning as compared to centralized solution. Hence, the general graph partitioning techniques may not lead to better performance when implied to RDF graphs. Therefore, clustered RDF storage solutions should take into account the properties of RDF and Linked Data as well as the expressive features of SPARQL queries when partitioning the given dataset among multiple data nodes.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 31/May/2019
Review Comment:

The paper provides a survey along with an overall evaluation of the techniques for RDF Graph partitioning under the same settings.

The paper has already been accepted in EKAW [1], I do not see any significant differences in this version except for some explanations of the results as the authors claim. Even those explanations are more of descriptions of the evaluation tables, stating the numbers which already exist in the table. I would be really interested in the interpretation.

The authors mentioned several times in the paper that, the general graph partitioning techniques may not lead to better performance when applied to RDF graphs. But, the seven partitioning techniques analyzed in the paper are referred to as the RDF graph partitioning techniques. There is no clear categorization of the general graph partitioning methods and the RDF graph partitioning methods. Also, there is no suitable evidence or results presented in their paper to support their claim.

In Subject and predicate Based partitioning, the modulo function is not explained. Explanation of the method would improve the readability.

In k-way partitioning technique, the TCV min partitioning and uncoarsening phase are not well explained. The reverse engineering technique is also not self explanatory.

It would be nice to have the justification of why 10 partitions are considered in section 4.1.5

In Table 1, if only Select queries are being used, why in the table Ask, Construct and describe are provided.

In the results section, the interpretation of the results, as in why one method is better than the other and how is missing.

There are grammatical errors and typos throughout the paper.Some of the major ones are
- in the first sentence of section 3.1 for horizontal partitioning, it should be ‘partitions of equal size’.
- Also in Section 3.3, the second sentence should refer to ‘predicate modulo’ and not ‘subject modulo’

The format of the references is completely off. It seems to me that the journal was submitted in a hurry.

[1] https://link.springer.com/chapter/10.1007%2F978-3-030-03667-6_1

Review #2
By Emanuele Della Valle submitted on 08/Jul/2019
Minor Revision
Review Comment:

# Overall comment

The paper is well written and well structured. To the best of my knowledge, this paper is the first to empirically position the different RDF graph partitioning techniques based on real data and real queries. The authors conducted the study using the appropriate methodology, technical tools, and statistical methods. I find particularly strong the usage of the statistical methods. The finding is not too interesting since there is no clear winner. However, I found the information contained in the paper useful. Therefore I consider the paper forth to publish.

# Minor comments

## Figures and Tables

I do not find appropriate the usage of a stacked bar chart in a logarithmic scale in Fig 6 especially because the visualisation tells only a part of the story, the other part of the story is in Table 2. I appreciate the attempt to create a dashboard, but the authors may have gone too far in the attempt to reduce the number of graphs. I recommend expanding the information currently displayed on page 11 on 2 more pages. Each page should concentrate on a single tool (one for FedX, one for SemaGrow and one for Koral) and report in multiple graphs the total execution time, the average execution time and the number of timeout queries per partitioning technique and per dataset.

In Figure 7, the rank cannot be larger than 1.0, so the y-axis should stop at 1.0.

In figure 9, I'd avoid the usage of textures. Scales of grey should be sufficient given the legend is in the same order of the bars in the 7 groups.

Table 4 may become unreadable if printed in grey scale. I'd use shades of grey here, too.

Table 5, the explanations of the *, **, *** are difficult to interpret. I'd use a sentence like "significant at 1%/5%/10% confidence level", which is easier to interpret.

## Page 13

It is hard to grasp the meaning of the sentence "we want to measure if the runtime performance improvement of on partitioning technique is significant while comparing the runtime performance of another partitioning technique"

## Page 15

The string "(p âL’d’ 0.01)" is not readable.

## bibliography on pages 16 and 17

The bibliography has several serious formatting issues:
- No name is capitalised.
- The titles should capitalise the names of the tools (e.g., FedX and SemaGrow).
- In reference 20 both the year and the editor are repeated twice
- Information is missing in references 6, 16, 20, 22 and 31

Review #3
Anonymous submitted on 25/Jul/2019
Review Comment:

This paper is a study of graph partitioning techniques applied to RDF graphs.
There is an introductory part where RDF and graph partitioning techniques are presented.
Then a benchmark is presented where several techniques are compared on three platforms on two RDF datasets.
The idea of the benchmark is to use federated query engines such as FedX to evaluate federated queries on several SPARQL endpoints where the different partitions are spread.
The benchmark produces data presented in tables an figures.

The paper is well written and is interesting.
It presents an impressive thorough hard work but it suffers from several problems.

The main problem is that at the end, we learn that some partitions may be better than other ones with three specific platforms on two datasets. We do not really get much.
The authors dot not really explain what depends on the federated engine, what depend on the RDF dataset and what depends on the partition. At the end, we do not know what to think about it.

The RDF presentation does not take named graphs into account. This may have an impact of graph partition.
There is an error in definitions about RDF.

The tables are not very clear, units are missing, log scale are somehow misleading.

The SPARQL queries in the benchmark do not cover whole SPARQL.
There are a lot of missing statements : property path, subquery, minus, named graph pattern, from, from named, exists, aggregates.

We do not know the size of the datasets, there is no example of queries, there is no URL where to look at query example.

A threshold of 180 seconds is set, after what a query is said to timeout. Why 180 s ?
How long time above 180 s do these queries need to complete: 181 s ? 1000 s ? 10000 s ?
Do they timeout without partition ?
When a timeout occurs, the authors consider the value of 180 s to compute average time, but may be the query would take 1 hour ?

At the end, the paper is difficult to read and we do not learn that much about the effect of RDF partitioning.
I would suggest a qualitative study to understand why some partitions would be better than others and in which case.


There is a problem with Definition 2 :
"E includedIn {(s, o)|(s, p, o) in G} is the set of edges between the
vertices and l(s, o) = p if (s, p, o) in G is the edge labeling function of G."
In RDF, there may be several edges relating s and o, with different predicates: s p o, s q o, s r o.
Hence the labeling function l(s, o) should return a set of labels: {p, q, r}

Named graphs are missing from the Preliminaries section on RDF.

It is not clear what "Rank score" and "Partitioning Imbalance" mean.
One sentence of explanation for each would be welcome.

"divided into n numbers of partitions of each size"
what does "of each size" mean ?

"later two will go"
next ten triples will go

"If there is a significant access of a group of rows together, then Horizontal partitioning may make sense."
"group of rows" is undefined.

3.3. Predicate-Based Partitioning

"The idea is to group all the triples with same predicate and assign them into one partition based on a hash value computed on their **subjects** modulo."
Is it their subjects or their predicates modulo ?

3.5. k-way Partitioning
This section is not clear at all, in particular the coarsening and uncoarsening phases are undefined.

"TCV-Min Partitioning: The total communication volume"
Define what is communication volume in one sentence.

Min-Edgecut Partitioning: explain why the method gets *minimum* of edges connected

Table 1

Results : indicate "number of results"
For BGP, TP, JV, etc. what do the numbers mean and what are their units ?
What is the difference between BGP and TP ?

Figure 6
Due to the log scale, similar times have different heights.
You compute average time on different datasets (SWDF and DBpedia), what does that mean ?

Figure 9
What does it mean exactly to have 1000, 2000, 3000 sources selected ?

"Consequently, the general graph partitioning techniques may not lead to better performance when **implied** to RDF graphs."
What do you mean by "general graph partitioning techniques" ?

In the bibliography, reset uppercases for names and acronyms (rdf -> RDF, sparql -> SPARQL)