Review Comment:
SUMMARY:
This paper is part of a larger research effort, where the authors work on combining Semantic Web (SW) and Business Intelligence (BI) technologies. BI technologies consolidate related technologies such as the Extract-Transfer-Load (ETL) process, Data Warehouses (DW), and Online Analytical Processing (OLAP) technologies.
The motivation of this work is given by the shortcoming that, besides the OLAP and DW technologies, also ETL process itself needs to be "semantified". This should overcome the shortcoming of the traditional ETL process approaches such as schema-dependency, as well as, the ETL process that is based on an inherent table-based "canonical" model taken from RDBMS (which makes sense since the extract/load is often from/into a RDBMS), hence the process fall short in handling graph-based data (such as in RDF).
First, the authors outline some general concepts regarding the notion of a semantic DW (SDW) based on the RDF Data Cube (QB) vocabulary (the W3C standard) and the QB4OLAP vocabulary. They extend the QB4OLAP vocabulary by introducing an "UpdateType" to the "LevelAttribute". Then, they present an interesting use case, which has the aim of integrating the "Danish Business dataset" with the "EU subsidy dataset". However, these datasets are rather simple with limited amount of dimensions, mainly there seems a location- and time-based dimension with a deeper hierarchy.
More importantly, they describe the integration process based on the ETL process and introduce a two-layered approach with a "Definition Layer" (with source-to-target mappings and target TBox definitions tasks) and an "Execution Layer" (with ETL operations and data staging), where the process has four steps: defining the target schema (TBox) using the QB/QB4OLAP vocabulary; extracting source TBoxes; create mappings between schema and sources with S2TMAP vocabulary; execute the process.
Second is one of the main contribution, the Definition Layer (DL); it is based on the target TBox definition (TargetTBoxDefinition), which is made of a "dimension schema" and a "cube" schema, where the former introduces the schema for each dimension including levels (including their attributes), the relation between levels (defining granularity), the roll-up relation. The later introduces the schema for cubes that include a bottom level dimensions, measures, and aggregate functions applied on the measures. Another element of the DL is the extraction of the source TBox (TBoxExtraction), where the authors recognize that the TBox and ABox is often provided, but they also suggest a rather simple extraction method. The central task of the DL is the source-to-target mapping (S2TMapping), which is at the heart of an ETL process. The basic idea of S2TMapping is the generation of a workflow of operation, where mappings between a source and target schema are defined, which then allows a transformations of source instances to target instances. The authors give a formalization of a S2TMapping, which includes besides the obvious elements (source, target) a mapping relation (super-/subsumption, equivalence, join), filter/join conditions, and an operator. Different mappings can be connected to form sequences of transformations, which seem restricted to acyclic sequences. They further introduce a formalization of the mapping language in OWL, hence introducing a meta-level over the standard TBox's OWL axioms.
Third is the second main contribution, the Execution Layer (EL), which consists of ETL operations taking the mappings (or user input) into account for parameterization of each operation. The extraction operations includes a graph and a TBox extractor, where the graph extractor covers the functions of a SPARQL construct query, which leaves the question open why not allow (restricted) SPARQL construct queries directly. The transformation operations includes (a) a literal-based and join-based transformation, (b) level/observation generators, a (tuple-based) data change detector, and (c) a level updater, inference-based materialization, and URI linker. The literal transformator evaluates property mappings that include arithmetic/aggregate functions. The join transformator evaluates joins that are defined in the join mappings by creating SPARQL queries. Level and observation generators do as expected and are designed to create level members and observations with additionally coining IRI for new instances. The data change detector acts as an observer of the input ABox and updates level members and observations according to given mappings, where either updating or the creation of new instances can be parametrized. The level updater can be triggered by the change detector and updates levels according to an "Update Type" (T1: overwriting old values; T2: new version of values with validity; T3: creating new value and annotating old value). The operation "inference-based materialization" allows to infer new triples according to the underlying TBox. However, this operator might create very large ABoxes.
Fourth, they provide two algorithms that automatically generate ETL data flows. Algorithm "CreateETL" collects all the top mappings and acts as an outer-loop that calls "CreateFlow" algorithm for each mapping. CreateFlow recursively finds operations for a mapping, parameterizes them, and checks sub-mappings that are defined in the source mapping. The algorithms result in a set (or rather a sequence?) of ETL operations.
Fifth, the authors describe their implementation called SETL_Construct, which uses a Jena triple store. SETL_Construct is extended with SETL_Auto for the automatic ETL flow generation. Nicely, the make the implementation available to the community on Github. The experiments focus on the comparison of their approach regarding productivity, development time, and (runtime) performance with their earlier work on Python-based ETL approach (called SETL_Prog. The evaluation of productivity is based on the comparison of user interactions such as clicks or selections with SETL_Prog. The information gain of this evaluation is rather limited, since comparing Python-based programming approach with an automated UI-based approach depends on the level of the user's skills and the used tools (for instance the use of Eclipse to reduced the amount of interactions). The evaluation of development time seems more appropriate for comparing both approaches, since SETL_Construct should give an advantage in creating and improving ETL flows due to the simple reuse of operations, which the authors could indicate in their experiments. The last evaluation showed the performance (regarding runtime) and the (disk-)space used in the different steps. SETL_Prog is more efficient regarding runtime and space usage, which comes to no surprise, since ad-hoc implementations give more freedom to optimize the code. We believe though, that a weakness of the approach is the usage of the Jena triple store, which is not known to be most efficient tool on the market. For the storage of intermediate results, maybe, a compressed storage of RDF such as HDT could be used.
Finally, the authors conclude their work with related work (which missed some discussions, see below) and future work. The authors also provide the semantics of their operators, which is based on the work on SPARQL (by Perez et al.). Future work aims at providing correctness/completeness of the ETL constructs, as well as the important topic of virtual data integration.
EVALUATION:
The main contributions of this paper are (summarizing the above):
- A general two-layered approach to a semantified ETL process
- A definition layer that includes a target TBox definitions and source-to-target mappings based on formalization and a OWL-based language to express the mappings
- An execution layer that allows to combine different operations such as GraphExtractor, LiteralTransformator, JoinTransfomrator, and Load, which are combined to build a full ETL flow. Furthermore a mathematical definition and a RDF- and SPARQL-based semantics for each operation is provided.
- An algorithm that is used to automatically generate the ETL operations based on the mappings and (user-given) parameter
- An implementation of their approach, which is compared to a Python-based approach in an extensive evaluation.
The topic of the paper and the main ideas are appealing and the paper should be published in the SWJ. The paper is in our opinion very well written and the motivating example is a recurrent theme throughout the paper and helps to understand the sometimes large definitions. We believe that in particular the separation between mappings and operations that are automatically generated and parameterized based on the mapping is a novel idea and an appealing result. The authors also provide a more formal definition of the operators, which is important to define a well-defined syntax/semantics. Further, a focus on DW-specific operators such as LevelMemberGenerator or UpdateLevel is made, hence taking requirements such as the "Update Type" into account. The work is in our opinion well-rounded and covers all aspects of their approach.
However, we also believe that this paper has certain unclear points, which should be addressed in the answer and (if reasonable) also added the document (order by significance):
1.) A central element of the approach is the source-to-target mappings, which allow different relations (e.g., equivalence, subsumption, join), fitler/join conditions, and property mappings. What is the reason for not allowing direct SPARQL queries to the source ABox?
In our opinion this would make the rather cumbersome definition of join relationships obsolete, and it seems that their mappings are anyway compiled into SPARQL (where the semantics is also based on). We agree that the full SPARQL standard might be to extensive, however, are restricted syntax could be used. This would bring another advantage that SPARQL operators (such as join, projections, etc.) could be rearranged for optimizing the evaluation, which is not possible in the current approach.
Another question regards the mapping relations subsumption and supersumption. Could the authors give an example of the usage for using this relation? Since this relation should be defined in the TBox and should not affect a transformation from a source to a target ABox
2.) This work is based on the assumption that all the sources are already ABoxes and encoded in RDF. However, a main point of an ETL process is to extract and transform unstructured (or differently structured) data. Hence the authors should at least add one sub-section that discussed on how the authors would address different data models: So if the source data, for instance, is stored in a RDBMS, it seems naturally to allow also an R2RML-based operator in the ETL load step. Would be possible also to parametrize the R2RML based on the mappings?
3.) In the evaluation the authors show the applicability of the their approach. However, the measurement of productivity based on interaction seems subsumed by the second aspect of development time (which also measures productivity). We rather suggest that the authors provide instead (or additional) to the first aspect a qualitative evaluation of their work. One possible choice would be a qualitative evaluation are guideline-based expert interview (e.g., as described in "Das Experteninterview" by Bogner et al.) with a small selection of experts or user, which should give insight in the usability of the presented approach.
4.) In the conclusion or related work section, there should be a more in-depth discussion on the advantages of semantification of "conventional" OLAP, versus Relational OLALP (ROLAP), and versus Hybrid OLAP (HOLAP). ROLAP seems orthogonal to their approach, since the results are (always) materialized in the triple store. Each type of OLAP has its advantages/disadvantages, which might (or might not) affect a SW-based approach.
5.) The mapping language itself and it's vocabulary is defined in OWL, hence adding some kind of mapping meta-level. Besides convenience, is there another motivation of having this in standard OWL? Since rule-based mappings (as used in R2RML, SHACL or dl-programs) might be easier to understand.
6.) The assignment of operators to the stages of the ETL flow seem a bit blurry. For instance, LevelMemberGenerator, ObsGenerator, and LevelUpdater could be part of the Load phase. Is there an argument why these operators are part of the Transformation phase?
MINOR ISSUES:
P11, left: (D_{name}, L, \arrow, H, F_R) should be (D_{name}, (L, \arrow), H, F_R)
P15: Listing 3 is quite long and hard to read, maybe a simplified version could be presented
P21, right: Is "MaterializeInference" needed? Since the triple store should provide the inference out-of-the box.
P22: The algorithm CreateFlow does not handle cyclic relationships between operators, are there restriction checked a-priori?
P33, left: Syntax/semantics is based on the work of Perez et al., since this is the foundation of the presented semantics, it could be moved into the preliminaries and introduced more in detail.
|