A systematic overview of data federation systems

Tracking #: 3201-4415

Zhenzhen Gu
Francesco Corcoglioniti1
Davide Lanti
Alessandro Mosca
Guohui Xiao
Jing Xiong
Diego Calvanese

Responsible editor: 
Aidan Hogan

Submission type: 
Survey Article
Data federation addresses the problem of uniformly accessing multiple, possibly heterogeneous data sources, by mapping them into a unified schema, such as an RDF(S)/OWL ontology or a relational schema, and by supporting the execution of queries, like SPARQL or SQL queries, over that unified schema. Data explosion in volume and variety has made data federation increasingly popular in many application domains. Hence, many data federation systems have been developed in industry and academia, and it has become challenging for users to select suitable systems to achieve their objectives. In order to systematically analyze and compare these systems, we propose an evaluation framework comprising four dimensions: (i) federation capabilities, i.e., query language, data source, and federation techniques; (ii) data security, i.e., authentication, authorization, auditing, encryption, and data masking; (iii) interface, i.e., graphical interface, command line interface, and application programming interface; and (iv) development, i.e., main development language, deployment, commercial support, open source, and release. Using this framework, we thoroughly studied 51 data federation systems from the Semantic Web and Database communities. This paper shares the results of our investigation and aims to provide reference material and insights for users, developers and researchers selecting or further developing data federation systems.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 09/Aug/2022
Review Comment:

The article studies systems able to process queries over a unified schema using multiple (possibly heterogeneous) data sources.
Besides the review of a large number of publications and systems, the paper introduces a framework that allows for the systematic study of systems according to different dimensions that could be of interest for different types of potential users.

The scope of this survey is wider than existing surveys as it covers multiple data models (relational, RDF, others), analyzes the characteristics of a considerable number of systems (both academic and industrial) and considers a fair number of dimensions.

The comments and doubts that I expressed in my previous review were considered in this new version. The authors made considerable changes to the paper to include more details for some of the dimensions, be more precise describing the methodology that they followed, and make the text of the paper easier to follow for readers with limited background in the area.

I have only a few doubts regarding the current version of manuscript:

It is unclear what "tuned for recall over precision" means under "The selection of industrial systems" (page 9). How could this "tuning" be understood (or reproduced) by others? How much higher is the weight of recall?

Systems such as CostFed, Odyssey, SemaGrow, and SPLENDID have been evaluated using LargeRDFBench (e.g., in [41], mentioned on page 29). Since this benchmark includes queries with UNIONs and OPTIONALs and the systems are able to process the queries without using "another, fully-fledged, SPARQL federation engine", then what is mentioned as part of the third observation about "Query language" (page 17) does not seem accurate.

Under "Source selection and query partition", the sentence "Other systems, like HiBISCuS, propose a refinement of the query-based strategy where the probing query has a complex structure based on the hypergraph underlying the input SPARQL query." seems inaccurate. In reference [66], Algorithm 1, line 17, only one subject, one predicate, and one object seem to used to perform "the probing query", but the sentence seems to suggest that [66] uses probing queries with a complex structure.

Why is reference [138] included in Table 9? This reference presents a data federation system and does not seem to focus on evaluation of systems more than the other systems included in Table 10.

Possible typos:
"consists in" (second paragraph, page 2)
"this sub-dimensions" (point 1, page 25)

Review #2
By Aidan Hogan submitted on 26/Aug/2022
Review Comment:

The authors have provided a very detailed and well-written response to my previous comments. I particularly appreciate the clarifications regarding selection criteria, as well as the provision of a concise but very useful summary of federation techniques. Granted, the paper still has some limitations (the clarification regarding selection criteria now highlights that – and reveals why – key papers like ANAPSID and QTree are excluded). But in general I think this is a very useful survey that will serve the community well, and is almost ready for publication, pending a revision of some superficial issues:

- "According to the prediction in [1]," As mentioned before, why not "According to Reinsel et al. [1]"? It is a matter of style, and thus the authors' choice, but I think the latter is more readable and more informative.
- "Transparent federation gives []users"
- "A simplified setting is []one where"
- "Also the transparent setting, though, is not devoid of drawbacks" -> "However, the transparent setting is not devoid of drawbacks"
- "This survey work [stems] from our needs [for]" or "This survey work was spurred on by our needs [for]" (the former is better)
- "This sub-dimension also permits [us] to distinguish"
- "affect the system['s] fitness for use"
- "considered as [per] their latest version" (As a minor quibble, this might be in contradiction with basing details on academic publications; often publications will emerge infrequently, while the underlying system might have more recent updates.)
- Very minor, but I would reduce the gray levels a little in Table 1, and similar tables, making the shading lighter.
- "the support [of] additional" or "[extending] the support to additional"
- "are relatively less considered" -> "are considered less often" (less is already relative)
- The switching of colours for headers in Table 3 is a bit distracting. I would suggest keeping all headers in the darker gray.
- "A fundamental classification of federation techniques for this component is between the ones where the metadata catalog ..." I cannot parse this part of the sentence.
- "as [per] the traditional setting"
- "and Stardog;´" There is some weird apostrophe here.
- "Obi-Wan is for the most part a proof-of-concept of a vaster and insightful theoretical exercise" This sentence reads a little strangely to me. I would suggest revising it. Maybe "of a more general theoretical exercise".
- "Hassan at al[.]"
- "only itself as [a] supported data source"
- "These venues turn out to include" -> "The resulting venues include"
- Figure 6: the 3D pie chart is best avoided (https://jscharting.com/blog/pie-charts-in-modern-data-visualizations/). A humble 2D pie chart would be much clearer.
Orthogonal to the paper, but regarding developers and users as a target audience, maybe you could publish something like a blog post or a Github repository that presents a summary of the information, perhaps with some of the tables. This is just a suggestion, of course, and it might be a lot of work, but such readers might not be inclined to dive into a 60 page academic paper.