Identifying, Querying, and Relating Large Heterogeneous RDF Sources

Tracking #: 3030-4244

Andre Valdestilhas
Muhammad Saleem
Edgard Marx1
Bernardo Pereira Nunes
Tommaso Soru1
Wouter Beek1
Claus Stadler
Konrad Höffner1
Thomas Riechert

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
The Linked Open Data (LOD) principles have been widely adopted due to its undeniable advantages; however, publishing and connecting data to third parties remains a difficult and time-consuming task. A question often raised during the publication process is whether there is a dataset available on the Web with which we can connect. Although it seems a trivial question, it unfolds in quite complex issues, such as, where the related datasets are, how many there are, how similar they are and how to query these datasets in a heterogeneous environment. This paper tackle the aforementioned questions introducing (i) a novel method to detect datasets similarities including duplicated chunks of data among RDF datasets; (ii) a publicly queryable index called ReLOD responsible for identifying datasets sharing properties and classes; and, (iii) a SPARQL query processing engine called wimuQ able to execute both federated and non-federated SPARQL queries over a large amount of RDF data. To create the ReLOD index and execute SPARQL queries over the Web of Data, we harvested more than 668 k datasets from LOD Stats and LOD Laundromat, along with 559 active SPARQL endpoints corresponding to 221.7 billion triples or 5 TB of data. We presented an evaluation of the accuracy of ReLOD and the query execution performance of the wimuQ over such massive amount of data.
Full PDF Version: 

Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 11/May/2022
Minor Revision
Review Comment:

As the initial version was on my table long ago, I read the paper as a new one, and only after looked through my first review - I had many, many more complaints back then :) The second version of the paper reads well, is clear, and doesn't contain any redundant parts. I summarised doubts I have about the quality of the paper in the comments below. They are mostly related to the mismatch between the timeline and the presented results: 2 years passed, the related work looks kind of outdated, whereas the future work wish-list is still long, and some points about the tool were only cosmetically addressed. The cover letter was a bit misleading - it is frequently referring to the sections that don't exist in the latest version.

* State of the art: in the whole reference list there are two references from 2020, and most of the others are older. "is still an issue" - according to the 2011 reference. "more recent studies" - see 2016 reference. These are just some examples. Is it really so that almost nothing related is happening around the topic in the last, say, 5 years?

* Motivation. Talking about the similarity of the datasets, is there a real problem to be solved? If subject URIs are the same, I likely know the other dataset exists. Whereas the intersection of property URIs alone is not necessarily an indicator of the dataset similarity. Real (kind of) use cases ex. in section 4.1 would help strengthen the motivation.

* In the evaluation section, quality and completeness of the results is based on the number of the results. This is a big assumption, why are sure that all the results are relevant?

* The cover letter is NOT very helpful. On many occasions the authors refer to the sections that are not in the paper! Examples:
** "Three more practical use cases were added in Section 1.2" - there's no such section
** "We got the gold standard from []" - empty reference, so not a very helpful response.
** "We also include new content in section 1.3 (“Compliance with Linked Data rules and LOD Cloud”)"" - no such section
** "The issue was addressed in the paper in Section 3.5" - no such section

* The number of "we will" in the conclusion is a bit too much for (a) a journal paper that (b) took 2 years to revise.


Typos and smaller comments:

* p.3 l.36: They --> Their
* p.7 l.32 We each --> We search?
* section 5, intro - a mess, most of the paragraphs are repeated twice
* section 5, the logic of using bold fonts for big chunks of text is not clear
* section 5.1.5 - the meaning of the last sentence is unclear, probably because of typos or missing words
* p.10 l.37 an higher --> a higher
* Figure 5: what do you mean by the number of bindings?
* p.11 l.43-47 - the sentence starting with "At present" is totally unclear
* p.12 l.39 - "A and B" are what? triples, URIs, else?
* p.14 l.39 1 million or 1 billion triples?..
* p.15 l.34 datasets share related --> datasets share is related
* p.15 the description of the "accuracy of our matching algorithm" deserves more details, as it is highly relevant
* p.16 l.45 - please proof-read the part staring from "On the Section 5.1.6"

Review #2
Anonymous submitted on 19/Aug/2022
Major Revision
Review Comment:

This review is a discussion of the differences between the first submission of the paper and the current version, and arguments as to whether or not they address the concerns raised by the reviewers in the first round.

In summary, whereas there are improvements, this paper does not seem ready for publication – and it should be at this stage of the process. Some of the added text does not seem to serve the SWJ audience, but rather to ease some of the reviewer concerns, and as such it fails at doing so.

It is clear that the paper has improved since the first iteration. However, it's important to note that the first version was suffering from a couple of serious issues that severely hindered readability, so the relative improvement as such is not the only indicator to be taken into consideration.

Some of the added paragraphs are indeed new text, but it can be questioned to what extent they add value to the text.

In particular, the added text on page 9 has multiple issues; it reads more as a defense to certain reviewer comments rather than providing an added value to the reader. Generic statements on higher recall and runtime are not backed up by evidence; furthermore, that paragraph is repeated twice. The "another question" is in essence the age-old problem of federated querying, which comes down to source selection. All in all, this reads like a hastily added block of text that might diminish the value of the manuscript rather than improving it.

Whereas most of the language issues of the original seem to be addressed, added texts also introduce numerous new errors. That by itself is not a blocking problem, but might indicate an insufficient quality barrier. At this stage, the manuscript is expected to be in near-final state, and language is just one of the many aspects.

Several pieces of evidence seem to be more anecdotal rather than structural, and as such the conclusions should be more nuanced than they are. For example:

> wimuQ+ReLOD is able to retrieve at least one resultset for 87 % of the overall 415 queries, which 11% more results thanks to the ReLOD approach. The results clearly shows [sic] that combining different query processing engines into a single SPARQL query execution framework lead [sic] towards more complete resultset retrieval.

I'm quite weary of self-fulfilling prophecies and the lack of repeatability in presence of statements such as "usability study, where we conduct the study with seven PhD students from our research group"; it's a piece of questionable qualitative evidence in an otherwise quantitatively driven study. The authors seem a bit lost here as to what they want to prove and how. I'm also not sure what to make of statements such as "the resulting scores from the usability study was better than we had expected", and what they are supposed to mean for the reader.

The "promises" in the Conclusion section are quite out of place. Future work is intended to explain how other researchers can build on top of your work, but rather it seems to be a list of shortcomings that the authors aim to address, which is not helpful to the reader. "We will make a better assessment" is especially unacceptable in this regard; this assessment should have been in this journal paper.