Identifying, Querying, and Relating Large Heterogeneous RDF Sources

Andre Valdestilhas
Muhammad Saleem
Edgard Marx1
Bernardo Pereira Nunes
Tommaso Soru1
Wouter Beek1
Claus Stadler
Konrad Höffner1
Thomas Riechert

Ruben Verborgh

Full Paper
The Linked Open Data (LOD) principles have been widely adopted due to its undeniable advantages; however, publishing and connecting data to third parties remains a difficult and time-consuming task. A question often raised during the publication process is whether there is a dataset available on the Web with which we can connect. Although it seems a trivial question, it unfolds in quite complex issues, such as, where the related datasets are, how many there are, how similar they are and how to query these datasets in a heterogeneous environment. This paper tackle the aforementioned questions introducing (i) a novel method to detect datasets similarities including duplicated chunks of data among RDF datasets; (ii) a publicly queryable index called ReLOD responsible for identifying datasets sharing properties and classes; and, (iii) a SPARQL query processing engine called wimuQ able to execute both federated and non-federated SPARQL queries over a large amount of RDF data. To create the ReLOD index and execute SPARQL queries over the Web of Data, we harvested more than 668 k datasets from LOD Stats and LOD Laundromat, along with 559 active SPARQL endpoints corresponding to 221.7 billion triples or 5 TB of data. We presented an evaluation of the accuracy of ReLOD and the query execution performance of the wimuQ over such massive amount of data.
