Review Comment:
# Summary
The paper provides a survey of data federation systems, including systems based on SPARQL, SQL, or indeed other query languages. After a general introduction to the area, and a description of the survey methodology, a total of 48 systems (17 academic, 31 industrial) are included in the survey. The survey itself is structured around four dimensions: (i) federation capability; (ii) data security; (iii) interface; (iv) development. The federation systems are compared, in detail, with respect to these four dimensions.
# Strengths
S1: There are many works dealing with data federation, where this survey distills a wide range of literature into a coherent discussion. I believe that it will serve as a very useful reference for those interested in practical aspects of data federation, particularly newcomers.
S2: I appreciate that the authors have gone beyond a literature survey, and have included industrial systems without a formal publication. Often such systems have more impact in practice, and thus their inclusion helps to paint a more complete picture.
S3: Overall the paper is well written, well-structured and easy-to-follow. The dimensions selected make a lot of sense. I believe it should be accessible to (and of interest for) a fairly wide audience, including potential adopters, developers, etc.
# Weaknesses
W1: The survey mainly deals with high-level aspects, not delving too much into technical details of how federation, specifically, is conducted. I think this is a reasonable choice in order to keep discussion accessible and concise (per S3). However, in parts, the paper does refer to specific techniques, such as join algorithms, source selection strategies, etc. (particularly in Table 3). This is welcome, but in order to keep the discussion self-contained, I would like to see the authors ideally provide a brief description of the different algorithms discussed, or at least reference a work that describes these techniques in more detail. In other words, the paper should help a reader unfamiliar with these techniques to learn about them.
W2: The authors apply quite a strict pruning policy for papers to be considered as part of the core survey. Again, while this is somewhat understandable in order to avoid a massive survey, and the criteria largely make sense, it leaves a question mark regarding the "completeness" of the survey. (See also O2 below.)
# Detailed Observations
O1: "As a result, federated query answering via data virtualization reduces the risk of data errors caused by data migration and translation, decreases the costs (e.g., time) of data preparation, and guarantees the freshness of data." While true, the authors paint a one-sided picture here. Federated query answering also adds runtime overhead in terms of combining query answers from multiple independent sources; in this setting, federated queries are much more difficult to optimise. I would suggest to the authors to discuss the potential downsides of federation as well.
O2: "we ignored the ones without free access". I think it's important to be more specific here. Is this "free" access from within a university (in this case, "free access" is inaccurate)? If so, what subscriptions are available? Does it include preprints on homepages? Depending on the precise criteria, the percentage of papers captured could range from a low percentage (gold OA) to a very high percentage (IEEE, ACM, Springer, etc., via subscription, as well as OA, preprints of formal publications, etc). More details are needed. Perhaps also the percentage of papers having free access could be presented to clarify.
O3: Any standards-compliant SPARQL 1.1 engine will support federated querying through the SERVICE keyword, but I notice that such systems are not automatically included in the survey (including BigData/Blazegraph, Jena/Fuseki). Is this due to the fact that these engines did not "turn up" during the literature survey, or are they excluded for another reason? It would be good to clarify.
O4: "vs. specific Web APIs like Facebook one" What about GraphQL? I am slightly surprised (given its popularity) to not see it featured anywhere here. Is it really not considered by any federation system? (Note that GraphQL, despite its name, has little to do with graphs, but indeed would fit under Web APIs.)
O5: "Besides, the well-formalized syntax and semantics of SQL" Hmm. Is the semantics of SQL well-documented? This might require clarification. There have been works in this direction motivated specifically by the fact that SQL's semantics is not very well formalised. See, e.g.:
Paolo Guagliardo, Leonid Libkin:
A Formal Semantics of SQL Queries, Its Validation, and Applications. Proc. VLDB Endow. 11(1): 27-39 (2017)
O6: "Automatically discovering such interrelationships may help developing data federation systems with higher efficiency." Most engines presumably consider joins across sources, so it seems that they do discover such relations across sources (specifically, information overlapping, i.e., information about the same resource). But from the example, I understood that something higher-level is intended here, relating to, e.g., containment relations, etc., between sources. It would be good to clarify this a bit more. In particular, I think the last statement could be more specific: "However, the current methods and systems are usually limited to virtually accumulating all the considered data sources, while ignoring the relationships among them." One could argue that these relationships are considered while computing joins between sources.
# Recommendation
I think the paper can be accepted with minor revisions, where I would ask the authors to:
- Address W1 by providing some obvious way for the uninitiated reader to understand the contents of tables such as Table 3.
- Clarify W2, perhaps by giving the percentage of papers in the full set that failed specific criteria.
- Revise observations O1-06, and also the minor comments below.
# Minor Comments:
- "Such [an] interface"
- "called [a] virtual database"
- "the data to [] centralize[d] storage"
- "which respectively provide[]" ... "and provide[] [statistical] information"
- "sources participat[ing]"
- "[that] the data source support[s] is different"
- "as [per] the one in Fig. 2."
- "data federation platform Denodo as an example" ... "Interested readers can refer to the official documents for more details" I would suggest to add a citation or other pointer.
- "phd" -> "PhD"
- "in form of training" -> "in [the] form of training"
- "also to help readers familiarize" -> "also to help familiarize readers"
- "The authors of [6]" Why not "Halevy et al. [6]" It is more readable and more informative. Likewise "The survey in [113]" -> "The survey by Magnani and Montesi [113]", and so forth for similar cases.
|