Review Comment:
The paper "Semantic Integration of Multidimensional Statistical Data: The CubeModeler Framework" addresses the question of how to more easily integrate heterogeneous statistical data, such as from the World Wide Web, e.g. in CSV format.
Its goal is to present a modelling tool and acompanying methodology for statistical data and to evaluate its broad applicability.
Its approaches are:
* the reuse and elaborate application of a standardised vocabulary for statistical data on the semantic web, the RDF Data Cube Vocabulary (QB).
* an implemented tool for modelling and integrating statistical data using the QB vocabulary when storing the RDF in a database (RDF triple store, with SPARQL support).
* a methodology of how to use the modelling tool
* two examples of using the tool in a use case (in sports analytics, and in environmental monitoring)
* an evaluation with two other approaches w.r.t. modelling effort (in clicks) and performance (in time for preprocessing/loading)
The conclusions of the paper are that its approaches - despite some limitations and several interesting open works - is broadly applicable, easy to use and allows efficient preprocessing/loading.
There are several things I like about the paper:
The central question asked by the paper is both interesting and important. At a time when users pose questions more and more to chatbots and those chatbots need to find relevant information via search engines, databases and other tools, it is all the more relevant to provide harmonised access to as many statistical datasets as possible.
I personally like the RDF Data Cube vocabulary and find it useful and (all the more) promising to bridge the gap between (mostly internally built and used) data warehouses / data analytics tools and the (mostly used for communicating with external parties) semantic web. Therefore, I appreciate research and applications around it.
Also, the appropriate methods are used to address the question, i.e., a method to reuse RDF Data Cube vocabulary features such as declarative statements between dimension properties in hierarchies, an implementation and methodology, two practical applications in use cases, and a quantitative evaluation.
Also, the data from the evaluation does support the conclusion of a potential useful and widely applicable tool.
Yet, I found the following major issues with the paper:
1) The paper lacks significance of results with respect to the approach. The superiority of the introduced method is not clearly explained. For instance:
* The paper stems from the expertise in ontology mapping and knowledge graph construction; this view point I like and am convinced brings in some novelty. Still, in the current status of the work, this contribution is not made clear, enough. For instance, the SDM-RDFizer and Morph-KGC are cited to have duplicate-aware operators; however, with statistical linked data this seems rather far off, no example or explanation from the current work is given.
* One method is the use of rdfs:subPropertyOf between different dimension properties as a good balance between rigid and flexible modelling. Although this is an interesting idea I do not see its benefit. If all datasets are modelled while loading them into the system, one can directly try to use the same modelling as the basis for integration. Similarly, yes, it is easier to match properties via rdfs:subPropertyOf instead of using owl:equivalentProperty or the same URIs since it is not as rigid, but still respective modelling efforts are needed. The same goes for code lists and slices/dices. Yes, you can more losely relate different data cubes. But in the end, you will have it more difficult to query those data cubes together.
* I was disappointed to read "Further data modeling challenges regarding data cubes exceed the scope of the current paper". How many more are there? Is this just a tiny bit of possible heterogeneities? From a journal paper, I expect some kind of "completeness" for some view on the problem/question at hand. I would at least list (and give names to) different challenges.
* The SPARQL queries are difficult to understand. For instance, for Listing 1, the description says "This query combines two types of cubes (games, player statistics) for different leagues for a specific player.", however, in the SPARQL query, I only find one data cube with ?statline its observations and without any hint that there are indeed several cubes (qb:DataSet or qb:DataStructureDefinitions) queried. Maybe I am missing some greater contribution as to the querying the "unified view", but at the moment I mainly see the rather large effort in both modelling and querying. The use of "BIND" constructs in the query add further complexity and is not explained. This does not fulfil the claim of "showing how complex yet intuitive querying across semantically aligned datasets can be performed".
* Yes queries could be filled in by a form with a simple name input. Yes, that is true. Still, coming to the query does not seem intuitive/easy to me (even as a person having worked with this kind of data a lot).
* "traditional integration would require flattening these datasets into a single schema or building complex mappings between each structure, something that can be fragile and time-consuming." Even though this statement is correct, it does not say much about the papers work. The work lies between those two extremes. How the approach of the paper finds a better balance between those two extremes, remains unclear.
* The term "component" is not used consistently throughout the paper which makes understanding difficult. From its first mention in the abstract "The method follows a clear sequence of modeling steps — including DSD construction, component and codelist definition, dataset description, semantic transformation and SPARQL querying.", over its mention in the introduction "This is also the case for statistical data, which often rely on metadata and contextual dimensions such as time, location, or measurement unit. These are core components of their structure, essential for interpreting and comparing values in semantic data integration." to its mention in "creating or reusing components and codelists".
* The limitations chapter discusses possible open work such as on query optimization. The possibility of "alternative integration approaches outperforming CubeModeler" is very nicely but rather (too) shortly described. Since there is a separate section on "Future Work", I would recommend to separate those topics more clearly: A "discussions" chapter on the interpretation of the results with respect to the actual goals of the paper (such as w.r.t. to alternative approaches), and an "open work/future work" section describing possible work beyond the scope of the paper.
2) The paper lacks significance of results with respect to the evaluation. The chosen and measured quality metrics are poorly chosen, and a reproducibility of evaluation is not given. For instance:
* In the evaluation, the usage of Karma and RMLMapper is not sufficiently explained for understandability and reproducibility. Both tools/methods could have been explained and related to the current work in more detail in the related work section. Instead, they are very superficially introduced, there. At the least, this could have been added to the appendix.
* I appreciate the elaborate evaluation method used for comparing workflow simplicity, model reusability and semantic flexibility between CubeModeller, Karma and RMLMapper. However, the times for different tasks seem rather arbitrary to me. Similarly, the Reuse Ratio does not seem helpful to me; how comes reuse is different between CubeModeller, Karma, and RMLMapper if they have the same expressivity? Also, Precision/Recall for me seems not a good metric for integration scenarios. At least, I have so far not seen them used. In general, it would have helped the evaluation if (peer-reviewed) papers would have been cited that use similar evaluation methods.
* In the evaluation, do the scenarios increase in complexity? If so, it would be good if a column in Table 1 would show that. Only then, it will make sense to speak of a trend over complexity w.r.t. reuse ration between different approaches, in Figure 12.
* It is good to have performance evaluations included in the paper. However, in my opinion, the performance of integration/loading is much less interesting than the performance of querying of the data. For one thing, the preprocessing and transformation may more show the focus/wilingness by the respective developers (of each tool) on proper software engineering; and may show less the appropriateness of the method/approach. For another, in applications, the focus is much more on query performance than ETL performance. By the way, the loading of different sizes of data could have been shown as a line chart to show some trend, e.g., of linear or exponential growth. ETL and query performance may be related since more complex integration/loading may lead to more (or less) complex querying. However, the paper does not discuss this. The paper discusses some preliminary query performance tests, only.
* The supplementary material (SWJ_CubeModeler.zip) does only include the latex source of the paper. The actual source code is only available after request and may be published in Q1 2026. I think, this should be clarified before publication.
3) And the quality of writing/presentation can be improved, most importantly with a common thread throughout the paper.
* Introduction and Related Work build a lot of suspense, e.g., "increasing volume and variety of statistical datasets published in decentralized ways highlights the need for a consistent and streamlined data integration process", or "alternative modeling-based approach to data integration is explored, shifting the integration process to the modeling stage". I recommend to be more clear about the contribution of the paper.
* The chapter "Dynamic evolution" mixes results and discussions of results, which I find confusing. By which number is the statement "near minimal-effort scalability" confirmed? This may also be improved by more descriptive chapters, e.g., instead of "Dynamic evolution" maybe "Applicability of CubeModeller with increasing number of datasets". Similarly, having one chapter called "Modular Scalability and Integration" and one "Performance Overview" shows potential for more descriptive naming with a common thread throughout the paper.
* In general, I think, the evaluation (effort in modelling, expressivity in modelling, performance of translation) could have been put more in alignment with the research question / problem in the introduction. For me, they came rather "out of the blue" and were not systematically lead to. This is also related to my comment of "building suspense" in the introduction withouth fulfilling it in later chapters; instead I would have let the readers know more early and clearly what will be done in the paper.
* For instance, it may help to define a "persona" in the introduction or use cases having the investigated problem of integrating various statistical datasets (e.g., for him/herself or its group/department). This person may or may not be also the user querying the data.
* Figure 1 mixes data flow and knowledge graph which is difficult to understand. Do the colors carry any specific meaning? If so, please explain, if not, consider leaving out the colors.
* Figure 2 again mixes different meanings of arrows. It is not clear what the visualisation should explain. Maybe add a descriptive text to the figure as to make clear its message, otherwise consider leaving it out.
In summary, with some novelty about solving the problem of statistical data integration, the paper shows potential for being published at SWJ.
Yet, in its current form of significance of results and quality of presentation, I can only recommend a "major recision" of the paper.
Ideally, the paper contains:
* A well defined set of possible heterogeneities between tabular datasets with statistics.
* A well defined set of methods to deal with those heterogeneities with as little effort as possible, preferable to add declarative knowledge and to have a clean translation process from tables to RDF data cubes.
* A well defined set of queries with placeholders as a blueprint for querying the unified view of data cubes.
* An application of the approach to one or more clear use cases, from the administrator's view managing the unified view to the end user having information needs fulfilled.
* An evaluation based on an (open source) implementation applied to the two use cases, with all resources made openly available for reproducibility.
For acceptance, much less will be required, but - in my opinion - the paper should clearly go into that direction.
Minor comments:
* I like how the paper introduces slices as a means for better querying.
* Figures 3 and 4 are nice, and I see some meaning in the coloring. Figure 5 is nice, too.
* What if datasets are modelled in QB vocabulary already? Can we simply import them?
* I found a few spelling errors, e.g. "cases are prsented".
* Some sentences / paragraphs are too informal for my taste, e.g., "alignment isn’t obvious, CubeModeler offers a “smart slice mapping” feature that automatically detects shared values across slice-level components to generate grouped slices. The algorithm ensures each slice includes all observations with matching values in the relevant columns" or "if a semantic transformation task for the same DSD has to take place later with another dataset, it can leverage the respective configuration workflow and automatically all settings will be imported and adjusted directly to the mapping interface. In this way the user can simply click the execute button to proceed to the RDF transformation and have the final RDF file in a blink of an eye." or "Finally, the dataset is totally transformed to RDF in TTL format".
* With respect to cited papers: Several references are missing their publication year. Many cited papers are rather old. No previous workshop or conference publications on the topic by the authors (some previous work on that topic at ESWC or ISWC workshops / conferences by the authors would give some credibility). A lot of papers about ontology matching and knowledge graph construction are cited, the authors show high knowledge of the field; about the topic of statistical linked data, this is less the case. Also, I noted that several times, several similar papers were cited by the same authors (e.g., by Huang W, and by Haves-Fraga D), this could be in better balance.
* I would suggest to consider the following papers for your work as related work or foundation. As a co-author of the papers, I am biased. Still, I think they fit nicely with your research goal of easy integration to a unified view based on the QB vocabulary.
Bischof, S., Harth, A., Kämpgen, B., Polleres, A., & Schneider, P. (2018). Enriching integrated statistical open city data by combining equational knowledge and missing value imputation. Journal of Web Semantics, 48, 22–47. https://doi.org/10.1016/j.websem.2017.09.003
For instance, the paper defines a unified view as the basis for querying, and describes methods to convert measures from one unit to another, and to predict missing values.
Kämpgen, B., Stadtmüller, S., & Harth, A. (2014). Querying the Global Cube: Integration of Multidimensional Datasets from the Web. EKAW 2014, 250–265. https://doi.org/10.1007/978-3-319-13704-9_20
For instance, the paper defines a unified view (Global Cube) and describes different integration challenges such as different dimensions, different dimension names, different levels of detail, and different units of measurement.
It describes a method to convert measures (with declarative descriptions, from one unit to another or one or more measurements to another) and to merge two data cubes.
Kämpgen, B., & Harth, A. (2011). Transforming Statistical Linked Data for Use in OLAP Systems. I-Semantics 2011, (Mdm). http://www.aifb.kit.edu/web/Inproceedings3211
For instance, the paper includes reasoning over equivalent dimensions (based on owl:sameAs) to integrate one or more cubes.
* The cover leter by the authors contains information that is better placed directly in the paper, e.g., the "Summary of the manuscript’s key contributions" and the description of supplementary files in a GitLab repository.
* As the presented tool "CubeModeler" seems very mature, maybe consider publishing it in the category "Reports on tools and systems".
|