Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.
==================== GENERAL FEEDBACK ============================
This paper introduces a new SparkSQL-based SPARQL query evaluator
called DIAERESIS. This system is based on a novel Data Partitioning
scheme that tries to split the graph using information derived from
the schema graph.
The paper is composed to two main parts, a first part explains the
novel partitioning scheme and second part validates the interest of
the approach by running a benchmark comparison with other Spark-based
SPARQL query evaluators, namely S2RDF, SPARQLGX and WORQ.
I think the approach has great qualities:
- extending the vertical partitioning with a partitioning on nodes is
a good idea, especially for a Spark-based method where this
partitioning serves as a sort of index.
- the approach was actually implemented with the code is available
- the authors ran a benchmark over several datasets with several
competitors, which is a long and challenging task
However, I also see important problems in the current papers:
- many parts of the paper are hard to understand. For instance, I
know RDF and BGP, but I found your explanation of those a bit
confusing. In particular, while I understand the overall approach
for your partitioning, much of the actual details explained in
section 4 remain unclear to me.
- The benchmark does not fully convince me of the interest of the
approach. The benchmark in the following directions (see the
detailed comments for more information):
- give more details in the setup
- use more/other datasets that do take into account the scale of the data
- use competitors that are not Spark-based. I fully agree that it
is not because Virtuoso (or any other) outperforms you in
certain cases that your method is not interesting, but we need to
see a comparison with up to date and actually used stores.
- explain the situation w.r.t. LUBM and SWDF(u) (see below)
- since your method is based on a parameter k, explain the effect
of k on the performance of your system. At the moment I am
completely unable to say whether your method performs well
because Spark SQL is fast or because the partitioning is working
as intended. Also fix the optimization bias error that, I think,
you make (or explain why it is not an issue).
For all those reasons I hesitated between reject and major revision.
I selected major revision because I think this work has potential in
terms of significance and originality but please note that this paper
will require a lot of work to be accepted both in the writing and in
extending the experimental part.
==================== DETAILED COMMENTS ============================
Note that I gave many comments and therefore there might be more
issues that I have forgot to explain here.
--- Comment #1
The first § of the introduction is a bit too much hyperbolic. For
instance in:
To store, manage and query these ever increasing RDF data, many
distributed big data processing engines have been developed, like
Hadoop, HBase and Impala
it supposes that those tools (Hadoop, etc.) were developed for RDF, this is false.
--- Comment #2
The problem § includes this sentence:
As Spark is focusing on a balanced data distribution, by default, it
partitions triples based on a hashing of the whole triple.
Of course Spark will use hashing for the operations such as groupBy, reduceBy, etc.
but the storage in HDFS does not use hashing.
--- Comment #3
In the following sentence:
The set C includes all classes and the set P includes all properties, except
rdf:type which connects individuals with the classes they are instantiated under.
What do you mean exactly? That rdf:type is not a member of P?
--- Comment #4
Why introduce complex RDF machinery (such as RDF datasets) when you
seem to ignore their existence in the following? It feels like you
could just say something along the line "the core component of any
SPARQL query evaluator consists in matching BGP to labeled
graphs". Indeed, in the SPARQL queries that you support there is no
way to actually navigate the various graphs of an RDF dataset.
--- Comment #5
I understand that you don't want to introduce the full SPARQL and
that you don't even want to explain in lots of details what is BGP
and what is the semantics of matching but the current explanation
is unsatisfactory: I believe it won't be able to help someone who
does not already know about SPARQL
The introduction of SPARQL is not very clear. It does not state that
you limit yourselves to BGP queries (even if SPARQL is bigger than
just BGP queries).
I don't fully agree with "Join operations are encoded in those queries
by sharing the same variable in more than one triple pattern". To me,
each TP defines a set of solution and then the BGP is just a join
between the solution of all TP composing the BGP.
The part on star and path queries should not be in the same § as the
introduction of what is a BGP.
For this whole section 2.2 an example would be useful.
--- Comment #6
In 2.3 and later in the related work you write about the "Spark’s
default implementation for querying RDF datasets" but Spark does not
have a "default implementation". You can talk about the "naive"
implementation if you want (storing everything in a single file).
--- Comment #7
The related work only mentions alternative based on Spark but many
SPARQL query evaluators are not based on Spark and some do use a
partitioning such as the one you have. Even if this work is the only
Spark-based query evaluator that does partitioning you have to
mention them.
--- Comment #8
I was not able to understand much of section 4.1 and 4.2. Thankfully
the paper is written in a way that I can read the rest supposing a
black box giving me the partition.
--- Comment #9
Equation (2) what does normal mean exactly? I get that it is
normalized but how? Is normal(BC(v)) divided by n² for n the number
of nodes? And for InstV(v)? Is it divided by the total number of nodes
in the graph?
--- Comment #10
The Definition 3, as it is written, does not make sense. You take
particular nodes u_i and u_j but how does the result does not depend
on the particular choice of u_i and u_j? Also, what |u_i| and |u_j|
means? Should I somehow interpret this as a sum over all possible u_i
and u_j? The notation Distinct(u_j) for the number of distinct pairs
(u_j,u_j) does not make sense to me. Is there a typo stating that one
u_j should be a u_i? if so why does it not appear as a parameter of
Distinct?
--- Comment #11
As said above, I did not get much of section 4.1 and 4.2 but the
claimed complexity in 5.1 does not depend on the number of edges in
the graph but CC(p(v_k,v_s)) was depending on the number of distinct
u_j, u_j. How do you compute this without looking at the edges?
More generally how, starting from a bunch of triples, do you compute
the number of distinct nodes without going through each triple? Do you
suppose that the input is already indexed in some way?
--- Comment #12
I understand that you add the constant (1+|v|) / |v| so that the CC
stays above 0 but why this specific number? Is there some reason to
use this and not e.g. 1? Note that on a large graph this number will
actually be very close to 1 in practice…
--- Comment #13
I don't see the point of giving the pseudo code "algorithm 1" without
giving what each function used by algorithm 1 does:
caclulateImportance (sic), selectTopKNodes, selectPartitionBalanced,
schemaNodesInPathWithMaxDependence, getNeighborsAndProperties,
instances.
--- Comment #14
LUBM is a benchmark with an ontology. If you do not take the ontology
into consideration many of the given queries return empty sets. How
did you take into consideration the ontology? If you did not take the
ontology into consideration discarding the queries returning empty
results might widely change the result.
--- Comment #15
LUBM comes with 14 queries, what are the 13 provided queries of line
21 page 13?
--- Comment #16
SWDF fits within 49.2 MB. I feel it is not very fair to include such a
small dataset in a comparison between Spark-based query evaluator
because the typical HDFS block size is bigger than this… Also it is
unclear what exact queries you used in the paper. On the github
https://github.com/isl/DIAERESIS/blob/master/queries/swdf/ we see a
lot of queries did you use those?
--- Comment #17
The selected datasets are relatively small for a big data tool
especially considering the potential problems with LUBM (see above).
--- Comment #18
It seems you are setting the parameter k (the number of partition)
using the benchmark and then publish the result with that benchmark.
It is an optimization bias. Furthermore, it would be much more
interesting to present a benchmark with different values of k so that
we see the effect of k.
--- Comment #19
Which exact version of the competitor did you use? They have been
published for some time with several versions and several options
available. What options did you use? For instance S2RDF and SPARQLGX
can use a statistic table, a prefix compression/expansion scheme and
Hadoop compression or an upper bond for extVP tables. Did you activate
some or these options or not? If not, how it is possible that DBpedia
is stored on 3 GB with SPARQLGX?
--- Comment #20
You use 4 physical machines but you only compare Spark-based query
evaluators. 4 machines is a bit small for large datasets…
Furthermore, you say that they have 400 GB of storage but what
kind of storage? For an experiment over large datasets it is really
important to know this as data reads and writes are generally the
bottleneck. If each machine has a single hard drive the 38 cores are
pretty much useless for most queries…
--- Comment #21
There seems to be very little change in processing time for SPARQLGX
between the various LUBM (except the big one and even the path queries
take less time than on LUBM2300). How is this the case? Is there some
"booting" time?
--- Comment #22
Fig 9 seems a bit useless to me.
--- Comment #23
For the Data Access Reduction I would like to see some sort of explanation.
--- Comment #24
You say that you competitors do not support the queries in SWDF(u), so
why include SWDF(u)? you can explain that you do support features that
they don't support but it seems a bit weird to include that has a
benchmark with 0 points of comparison. Also they both have a way of
supporting them by bypassing the (ext-)VP tables…
|