Review Comment:
The submission presents an architecture for querying data stored in a Semantic Data Lake, which is similar to the virtual ontology-based data access. The second contribution is a description of an implementation (Squerall) and its evaluation on a modified BSBM benchmark (why does the abstract mention 5 popular data sources?). These contributions fit the Knowledge Graphs 2018 special issue call for papers.
First of all, the notion of the Semantic Data Lake is a bit unclear - how is it different from the ontology-based data integration (OBDI)? Note that although the focus in OBDI has mostly been on relational data sources, the OBDI framework does not restrict the type of data sources in any way. Also, the novelty of the Semantic Data Lake architecture is unclear - the general architecture is presented in Section 2.2 but the authors, however, claim that they introduced the term in [10]. So, what is new in the submission?
Second, the solution presented in the main Section 3 for "enabling data joinability" can hardly be considered satisfactory: every time a pair of variables deemed to contain URIs from different datasources but referring to the same objects, the user has to add a special TRANSFORM clause to the SPARQL query that would describe the modifications of the URIs before they are "matched". This means, in particular, that if the same objects come from, for example, 3 different sources, then the user has a choice of specifying transformations between any of the three pairs of the datasources (that is, s1-s2 & s2-s3 or s1-s3 or s2-s3 or s1-s2 & s1-s3) or transformation between all three pairs (that is, s1-s2 & s2-s3 & s1-s3). In the latter case, the user can only pray that the transformations are compositional. What makes the whole approach truly cumbersome and error-prone is that those transformation would have to be repeated in *each* query to the datasources. Would this "matching" not be better placed at the level of mappings? Something like canonical URIs might also be an option.
Third, the submission is quite poorly written, with lots of omissions and hidden assumptions.
\\\\\ DETAILED COMMENTS /////
page 1, line 21: It is not clear what the "opposite" is - did the authors mean rather than "choosing application needs that suite the storage technique"? if so, then it does not sound plausible
page 1, right column, line 49: The authors mention "local-as-view", but I strongly suspect they actually mean "global-as-view", when the terms of the global schema are defined as views (that is, queries) over local schemas. Would the authors check the definitions?
page 2, left column, line 26: It is not clear who "we" in "we have previously introduced in [10]" are - the list of authors is not exactly the same (not even a subset).
page 2, right column, line 3: The emphasis on "declaratively" is unclear - this aspect has not been properly explained.
page 2, right column, line 16: Is the ontology really a taxonomy? Or is it just a vocabulary? In fact, the ontology is not described in the submission (apart from a short paragraph in Section 5.1).
page 2, right column, line 40: Cassandra appears out of nowhere - does the reader need to know what that is? Is it the most canonical example of relational DB?
page 2, right column: ParSet is too much of a jargon term and POZ is not properly explained - the meaning of "ParSets live and evolve" is unclear. Also, "data source is any source of data" is obvious and thus redundant.
page 3, left column, line 39: What are "mapping ontologies"?
page 3, left column, lines 46-50: It is not clear why a Query Catalyst would decompose BGPs into stars - what is the purpose? Also, the notion of stars is not defined. And why is it called Query Catalyst? What is so catalytic about decomposing BGPs?
page 4, right column, lines 30-43: The meaning of the paragraph escapes me - the example has different variable names, and the snippet of code should be explained in more detail (what is 12? what is the effect of toInt?).
page 5, left column, lines 37-51: Why not use the same example as in Fig 2?
page 5, Figure 4: The meaning of the diagram is unclear (and the explanation is not satisfactory). What is a and b? Are they values or sets?
page 5, right column, lines 26-29: I do not understand the formula - on the one hand, s_1 and s_2 are parameters of Join and so, one assumes they are given along with pred, for example; on the other hand, s_1 and s_2 also occur under the existential quantifier in the braces. So, what are they? And why are there braces at all? Is the result a set of joins of stars?
page 6, right column, line 18: "interface to the outside" is unclear - what is "outside"?
page 6, right column, line 30: What is the meaning of "We do not intend t generate RDF triples, neither physically nor *virtually*"? Virtually generating means not generating, does it not?
page 7, left column, line 18: "prefix nosql" makes no sense - the user may prefer to use a different shortcut for http://purl.org/db/nosql# (the name of the shortcut is irrelevant)
page 7, left column, lines 49-51: The sentence is too long and complex, and it takes a couple of attempts to link the two sides of the "the compromise" (and no : is needed).
page 8, left column, Table 1: Is it really 2.6M persons for 5M tuples? It looks like a 3-fold increase from 26K to 77K, but a 30-fold increase from 77K to 2.6M.
page 8, left column, line 42: Why is ACID important here? Did the authors run updates in parallel?
page 10, left column, line 5: The authors mention ontology-based data access, but I could not find anything related to ontologies in the submission - the only exception is the vocabulary used to describe mappings (which can hardly be called ontology-based data access).
page 10, right column, line 24: The authors claim that the source code of the Optique Platform is not publicly available. However, Ontop, the query transformation system of the Optique Platform, is publicly available (including the source code).
pages 12-14: Appendix A should really be available only online (most of it is standard).
\\\\\ TYPOS /////
page 1, line 26: "solubility of Squerall" reads as though Squerall is a problem and the authors are finding a solution for it
page 1, left column, line 45: "to dis-adhere" is a newly invented word - perhaps, "not to adhere"
page 1, right column: spaces are needed after Hadoop and others
page 2, left column, line 5: replace : by ,
page 2, left column, line 9: it looks like [7] is a reference for the physical data access
page 2, left column: DFS in HDFS stands for the distributed file system, and so there is no need to repeat it - perhaps, "Hadoop distributed file system, HDFS"
page 2, left column, line 22: can not -> cannot
page 2: "join-able" is not a word - why not joinable?
page 2, right column, l 13: middleware (no -)
page 2, right column, line 16: no , is needed
page 2, right column: there is no need for : in the items of the list in Section 2.1 - the bold-faced words are really parts of the sentences
page 3, right column: section 3 -> Section 3
page 4, left column, lines 47-48: strange grammar in "... would yield ..., or yields ... " - why "would"?
page 5: remove ".0" from section numbers
page 5, left column, lines 16-22: check the fonts used in the code samples
page 5, right column, line 37: Algorithm (capital A)
page 6, left column, line 5: remove the space before "line 9"
page 6, right column, line 16: a space is needed after "Spark"
page 7, left column, line 29: ".e.g.," -> "e.g.,"
page 7, right column, lines 18-19: check the grammar - user*s* do*es* ... issue*s*
page 7, right column, line 22: why Spar*K*?
page 8, left column, line 28: Table (capital T)
page 9, left column, line 21: *the* MPP principles
page 9, left column, lines 26-30: it looks like a cut-and-paste gone wrong
page 11, left column: is the URL in [1] official?
pages 11-12: are the URL in [8], [15], [19], [20] and others useful at all?
pages 11-12: check the name spelling in [12], [15], [27]
page 11, right column: [16] is in capitals for no good reason
page 12: the list of authors in [34] is shortened, yet a similarly long long in [10] is given in full
|