Characteristic Sets Profile Features: Estimation and Application to SPARQL Query Planning

Tracking #: 2903-4117

Lars Heling
Maribel Acosta

Responsible editor: 
Guest Editors ESWC 2020

Submission type: 
Full Paper
RDF dataset profiling is the task of extracting a formal representation of a dataset's features. Such features may cover various aspects of the RDF dataset ranging from information on licensing and provenance to statistical descriptors of the data distribution and its semantics. In this work, we focus on the characteristics sets profile features that capture both structural and semantic information of an RDF dataset, making them a valuable resource for different downstream applications. While previous research demonstrated the benefits of characteristic sets in centralized and federated query processing, access to these fine-grained statistics is taken for granted. However, especially in federated query processing, computing this profile feature is challenging as it can be difficult and/or costly to access and process the entire data from all federation members. We address this shortcoming by introducing the concept of a profile feature estimation and propose a sampling-based approach to generate estimations for the characteristic sets profile feature.In addition, we showcase the applicability of these feature estimations in federated querying by proposing a query planning approach that is specifically designed to leverage these feature estimations. In our first experimental study, we intrinsically evaluate our approach on the representativeness of the feature estimation.The results show that even small samples of just 0.5% of the original graph's entities allow for estimating both structural and statistical properties of the characteristic sets profile features. Our second experimental study extrinsically evaluates the estimations by investigating their applicability in our query planner using the well-known FedBench benchmark. The results of the experiments show that the estimated profile features allow for obtaining efficient query plans.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ruben Taelman submitted on 01/Oct/2021
Review Comment:

I thank the authors for their updated version, and for replying to and resolving all of the comments I included in last review round.

The main concern I had has been properly adressed.
The authors included section 3.5 that explains how the sampling approaches can be implemented over SPARQL endpoints and TPF interfaces.
This shows that full access to the entire RDF dataset of all members in a federation is indeed not required, as the authors have claimed.

The other two minor comments I had have also been fully resolved, by adding an appendix and footnote.

I therefore have no further comments on this article, and would be very happy to see it accepted!

Review #2
By Olaf Hartig submitted on 22/Nov/2021
Review Comment:

I thank the authors for responding to all the concerns I had raised in my review for the previous version of the manuscript, and for addressing these concerns in the revised version. I am particularly happy to see the added Appendix A with the CSPF creation algorithm and the evaluation that clearly shows the reduction of time needed to create the CSPF, as well as the added Section 3.5 that outlines concrete ideas how the authors' approach may be applied in a federation setting. Regarding the latter, I am still skeptical how well these ideas may perform in practice, but I appreciate the documentation of these ideas in the manuscript and I totally understand that an evaluation of them is out of scope of the work presented here.

With all my concerns addressed, I propose to accept the revised version of this manuscript as is. I congratulate the authors for a well carried out and masterfully presented research work that makes several original contributions.

Review #3
Anonymous submitted on 28/Jan/2022
Review Comment:

Thank you for the revised manuscript, and the detailed response to my points. Section 4 now reads much better, I am happy with the result. The only think I would ask is a line stating what getSubjectStars(P) is supposed to do, as it is currently shown in the Algorithm but not in the paper.

Regarding section 5, I am satisfied with your responses, but you should
incorporate them into the paper: Please add more detail regarding the goal of the experiments, regarding the fact that it is not clear
that longer query times may kick in at some point (if it kicks at all).