A systematic overview of data federation systems

Tracking #: 3074-4288

Authors: 
Zhenzhen Gu
Francesco Corcoglioniti1
Davide Lanti
Alessandro Mosca
Guohui Xiao
Jing Xiong
Diego Calvanese

Responsible editor: 
Aidan Hogan

Submission type: 
Survey Article
Abstract: 
Data federation addresses the problem of uniformly accessing multiple, possibly heterogeneous data sources, by mapping them into a unified schema, such as an RDF(S)/OWL ontology or a relational schema, and by supporting the execution of queries, like SPARQL or SQL queries, over that unified schema. Data explosion in volume and variety has made data federation increasingly popular in many application domains. Hence, many data federation systems have been developed in industry and academia, and it has become challenging for users to select suitable systems to achieve their objectives. In order to systematically analyze and compare these systems, we propose an evaluation framework comprising four dimensions: (i) federation capabilities, i.e., query language, data source, and federation technique; (ii) data security, i.e., authentication, authorization, auditing, encryption, and data masking; (iii) interface, i.e., graphical interface, command line interface, and application programming interface; and (iv) development, i.e., main development language, commercial support, open source, and release. Using this framework, we thoroughly studied 48 data federation systems from the Semantic Web and Database communities. This paper shares the results of our investigation and aims to provide reference material and insights for users, developers and researchers selecting or further developing data federation systems.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Andriy Nikolov submitted on 31/Mar/2022
Suggestion:
Minor Revision
Review Comment:

This survey paper reviews existing query federation systems from the practitioner-oriented perspective: namely, in terms of supported functional capabilities. The aim was to provide a qualitative assessment of a wide selection of available approaches from several domains over several high-level dimensions that would enable a potential user to select an appropriate system for a given practical use case rather than perform more in-depth research-oriented quantitative comparison, e.g., taking into account query performance. The survey does not only take into account Semantic Web-oriented systems, but also considers relational DBs as well as various NoSQL solutions. As such, it is probably more oriented towards industry use cases rather than academic research and provides valuable info to new practitioners who want to select an appropriate tool for their tasks.

One aspect that would be nice to clarify more, particularly with respect to Semantic Web-originated systems, is the distinction between two kinds of query federation:
- Transparent federation, where a query is formulated as if all data was stored locally and it is a task of the federation system to perform automatic source selection, query reformulation, and data transformation behind the scenes.
- Explicit federation, such as the standard SPARQL 1.1 federation using the SERVICE keyword, where the query itself already contains information about relevant data sources and maps each part of the query to an appropriate source.

Table 3 makes it look like only the first option is considered a proper federation (as metadata catalogs are not used for explicit federation, while source selection and query partition are trivial) and, indeed, transparent federation challenges are usually the main focus of related research in the community. However, in many practical use cases the second option is sufficient to support the use case: the queries are often not written manually by the end users, but are hidden behind the user app interface and are constructed by the developers who are already aware about data distribution. Perhaps it is worth expanding a bit on the SPARQL 1.1 federation, its capabilities and limitations, and include it into consideration as well. For example, in Table 2 that would imply that almost any SPARQL-based system would automatically support federation of graph-based data sources (e.g., Virtuoso).

If the goal of the survey is to provide potentially helpful information to new practitioners, then another point, which can be worth mentioning in the survey as well, is a possibility to extend the capabilities of one system by combining it with other tools. For example, not all RDF/SPARQL triple stores support virtual graph federation with relational data sources. However, a combination of a SPARQL graph-based federation with an OBDA solution like Ontop can enable this for almost any SPARQL engine. Similarly, a service federation engine like Ephedra from metaphactory can be deployed on top of any triple store and enable federation over web services (such as REST-based ones). This potential for combining solutions makes the notion of “supporting a data source type” rather less well-defined and can enable many “ticks” in Table 2. Perhaps, it might be worth mentioning this in the paper to prevent outright rejection of some options.

Regarding the security dimensions (Table 4), it looks like no SPARQL solution supports data masking. I wonder if the “named graph security” implemented in Stardog and applicable also to virtual graphs (https://docs.stardog.com/operating-stardog/security/virtual-graph-security) can be counted as an example of data masking as well.

Another dimension, also potentially relevant when making a decision, concerns any major infrastructural dependencies involved with a particular system: e.g., Amazon AWS infrastructure for Neptune. This can also be a factor during the system selection process within an organization, so might be worth adding a column on those.

Typo: “Costed” vs “cost-based models” (Table 3)

Review #2
Anonymous submitted on 14/Apr/2022
Suggestion:
Major Revision
Review Comment:

The article studies systems able to process queries over a unified schema using multiple (possibly heterogeneous) data sources.
Besides the review of a large number of publications and systems, the paper introduces a framework that allows for the systematic study of systems according to different dimensions that could be of interest for different types of potential users.

The scope of this survey is wider than existing surveys as it covers multiple data models (relational, RDF, others), analyzes the characteristics of a considerable number of systems (both academic and industrial) and considers a fair number of dimensions.
While this wider scope has clear advantages, it also seems to narrow down the details that can be provided from the different (types of) systems.
One example of this is the "Federation technique sub-dimension". The information provided does not seem suitable as introductory text targeted at researchers, PhD students, or practitioners. Given the in depth analysis done, some common terminology could have been defined and used to make the survey article more useful and easier to follow.
Another example is that given that industrial systems are more thorough describing aspects from the data security, interface, and development dimensions, and it is hardly a novel observation that academic systems, mainly described by their publications very often lack such information. However, the information provided for most of the sub-dimensions related to these aspects could have been better characterized to help the users make an informed choice according to their particular needs.
Trying to cover so dissimilar systems by the same framework, might have added limitations regarding how well the systems can be presented and how much the potential users of data federation systems could gain from reading this survey.

Giving a more detailed characterization of the techniques used for "Query optimization & Query plan generation" could have provided the readers with a better overview of similarities and differences across the studied systems

Giving a more detailed description of the interface dimension could have better informed potential users of the systems. With the current description and content of Table 5, why would anyone use a system that "does not provide any interfaces"?, but are those systems really lacking a "simple solution for automatically invoking the functionalities of a data federation system in other programs or scripts of a larger information system" (current description for Command line interface sub-dimension)? or was another implicit criteria used?

The third observation about "Query language" is not clear enough. Even if some of the systems given as examples there have proposed techniques to mainly optimize BGPs, some of them are able to process SPARQL queries with arbitrary graph patterns (that includes those with UNION or OPTIONAL).

The choice of dimensions relevant for "end-users, developers, and scholars" based on the personal experience of the authors could have been validated and some of those dimensions could have been characterized better so that the gathered information would be more useful to researchers, PhD students, and practitioners.

Doubts that would be better not to have after reading the paper:
Could a clearer criteria for including representative/classical academic systems have been given? (was there a number or percentage predefined for these systems? how was than decided? which venues? minimum number of citations? citation by which surveys? why those surveys?)
How was the last "keyword" used to select industrial systems prevented from retrieving too many systems? (each system found, could potentially lead to find more systems)
How are the "Encryption and Data masking sub-dimensions" different?
In section 5, it is written that "for the academic systems (17 in total), information was mostly extracted and summarized from their academic publications", did this decision prevented you from finding information related to academic systems in some of the dimensions? academic publications often focus on contributions in specific areas, e.g., within the "federation technique sub-dimension" and seldom would go into details regarding other dimension such as "interfaces" (but such information might be obtained from the official websites of the systems)
Does Virtuoso really provides support for data federation? (in its official documentation that is mentioned as "a planned extension", http://docs.openlinksw.com/virtuoso/virtuosofaq19/)
How were the sources that match more than one of the "6 types of sources" classified?
How were the "examples" given in Table 3 chosen? In the section about Federation techniques, it is mentioned "nearly all of the systems" "most … systems" or "widely used strategies", but relatively few examples are included. Is there any type of bias towards using certain systems as examples?
On page 23, it is mentioned that "an in-depth knowledge of each system and its specific dialect is needed to achieve its full performance", why was so little knowledge shared as part of this paper?
Is the example given about "Interrelationships between data sources" something that could be handled by some of the studied systems?
In what way is Obi-Wan [80, 81] different from other systems such as the ones mentioned in [120-122]? (related to the discussion about "Ontology-based data integration")
How could the systems studied in this survey be classified according to the terminology introduced in [118]?

Minor suggestions:
Check sentence: "if the query language of the data source supporting is different"
Check the classes in example 1: "Reviewer" does not seem to be used
Make clearer what is meant by "refer" in sentences like "the federation engine infers that triple patterns t1, t2, and t3 only refer source S1"
Add the relevant references when using sentences like "Interested readers can refer to the official documents for more details"
Check the text in the first stage in the framework design (Figure 3), it is unlikely that "the aspects interested for" was meant there
Try to keep consistency, e.g., from October until December in the same year there cannot be "more than three months" (sec. 3.1)
Make clearer what is meant by "reduce the technique problems" (page 9)
Make clearer what is meant by "their runtimes" and "influence system requirements" (page 12)

Review #3
By Aidan Hogan submitted on 20/May/2022
Suggestion:
Minor Revision
Review Comment:

# Summary

The paper provides a survey of data federation systems, including systems based on SPARQL, SQL, or indeed other query languages. After a general introduction to the area, and a description of the survey methodology, a total of 48 systems (17 academic, 31 industrial) are included in the survey. The survey itself is structured around four dimensions: (i) federation capability; (ii) data security; (iii) interface; (iv) development. The federation systems are compared, in detail, with respect to these four dimensions.

# Strengths

S1: There are many works dealing with data federation, where this survey distills a wide range of literature into a coherent discussion. I believe that it will serve as a very useful reference for those interested in practical aspects of data federation, particularly newcomers.

S2: I appreciate that the authors have gone beyond a literature survey, and have included industrial systems without a formal publication. Often such systems have more impact in practice, and thus their inclusion helps to paint a more complete picture.

S3: Overall the paper is well written, well-structured and easy-to-follow. The dimensions selected make a lot of sense. I believe it should be accessible to (and of interest for) a fairly wide audience, including potential adopters, developers, etc.

# Weaknesses

W1: The survey mainly deals with high-level aspects, not delving too much into technical details of how federation, specifically, is conducted. I think this is a reasonable choice in order to keep discussion accessible and concise (per S3). However, in parts, the paper does refer to specific techniques, such as join algorithms, source selection strategies, etc. (particularly in Table 3). This is welcome, but in order to keep the discussion self-contained, I would like to see the authors ideally provide a brief description of the different algorithms discussed, or at least reference a work that describes these techniques in more detail. In other words, the paper should help a reader unfamiliar with these techniques to learn about them.

W2: The authors apply quite a strict pruning policy for papers to be considered as part of the core survey. Again, while this is somewhat understandable in order to avoid a massive survey, and the criteria largely make sense, it leaves a question mark regarding the "completeness" of the survey. (See also O2 below.)

# Detailed Observations

O1: "As a result, federated query answering via data virtualization reduces the risk of data errors caused by data migration and translation, decreases the costs (e.g., time) of data preparation, and guarantees the freshness of data." While true, the authors paint a one-sided picture here. Federated query answering also adds runtime overhead in terms of combining query answers from multiple independent sources; in this setting, federated queries are much more difficult to optimise. I would suggest to the authors to discuss the potential downsides of federation as well.

O2: "we ignored the ones without free access". I think it's important to be more specific here. Is this "free" access from within a university (in this case, "free access" is inaccurate)? If so, what subscriptions are available? Does it include preprints on homepages? Depending on the precise criteria, the percentage of papers captured could range from a low percentage (gold OA) to a very high percentage (IEEE, ACM, Springer, etc., via subscription, as well as OA, preprints of formal publications, etc). More details are needed. Perhaps also the percentage of papers having free access could be presented to clarify.

O3: Any standards-compliant SPARQL 1.1 engine will support federated querying through the SERVICE keyword, but I notice that such systems are not automatically included in the survey (including BigData/Blazegraph, Jena/Fuseki). Is this due to the fact that these engines did not "turn up" during the literature survey, or are they excluded for another reason? It would be good to clarify.

O4: "vs. specific Web APIs like Facebook one" What about GraphQL? I am slightly surprised (given its popularity) to not see it featured anywhere here. Is it really not considered by any federation system? (Note that GraphQL, despite its name, has little to do with graphs, but indeed would fit under Web APIs.)

O5: "Besides, the well-formalized syntax and semantics of SQL" Hmm. Is the semantics of SQL well-documented? This might require clarification. There have been works in this direction motivated specifically by the fact that SQL's semantics is not very well formalised. See, e.g.:

Paolo Guagliardo, Leonid Libkin:
A Formal Semantics of SQL Queries, Its Validation, and Applications. Proc. VLDB Endow. 11(1): 27-39 (2017)

O6: "Automatically discovering such interrelationships may help developing data federation systems with higher efficiency." Most engines presumably consider joins across sources, so it seems that they do discover such relations across sources (specifically, information overlapping, i.e., information about the same resource). But from the example, I understood that something higher-level is intended here, relating to, e.g., containment relations, etc., between sources. It would be good to clarify this a bit more. In particular, I think the last statement could be more specific: "However, the current methods and systems are usually limited to virtually accumulating all the considered data sources, while ignoring the relationships among them." One could argue that these relationships are considered while computing joins between sources.

# Recommendation

I think the paper can be accepted with minor revisions, where I would ask the authors to:

- Address W1 by providing some obvious way for the uninitiated reader to understand the contents of tables such as Table 3.

- Clarify W2, perhaps by giving the percentage of papers in the full set that failed specific criteria.

- Revise observations O1-06, and also the minor comments below.

# Minor Comments:

- "Such [an] interface"
- "called [a] virtual database"
- "the data to [] centralize[d] storage"
- "which respectively provide[]" ... "and provide[] [statistical] information"
- "sources participat[ing]"
- "[that] the data source support[s] is different"
- "as [per] the one in Fig. 2."
- "data federation platform Denodo as an example" ... "Interested readers can refer to the official documents for more details" I would suggest to add a citation or other pointer.
- "phd" -> "PhD"
- "in form of training" -> "in [the] form of training"
- "also to help readers familiarize" -> "also to help familiarize readers"
- "The authors of [6]" Why not "Halevy et al. [6]" It is more readable and more informative. Likewise "The survey in [113]" -> "The survey by Magnani and Montesi [113]", and so forth for similar cases.