Review Comment:
The paper describes an approach to publish statistical data under the principles of Linked Data exposing a 5 star set of datasets. To do so, authors motivate the work due to the nature of data published by governments. They also make a review of some vocabularies, previous approaches and key points to consider when publishing and linking data. Basically, authors try to solve the problem of linking data to be able to infer new facts and add semantics to each data item. Finally, authors apply the methodology to a domain, a kind of disease database getting some datasets under the specification of the RDF Data Cube vocabulary. Furthermore, authors indicate some quality metrics to ensure data has been properly transformed. Authors also make an interesting discussion drawing some conclusions and future research lines.
In general, the paper is interesting, it presents a good application of linked data to a context: statistical data. The thing that many institutions basically publish statistical data is absolutely true and the need of consuming such information under a common and unified data model is becoming critical to be able to build added-value services. However, the approach should be improved in the following topics:
-State of the art. The main contribution of the paper is a kind of methodology to publish statistical data so, it is necessary to review the linked data lifecycles and the different stages that should be done. Furthermore, statistical data as authors mention requires some data model (combining different types of vocabularies, etc.) so, a review of the typology of vocabularies, linked data patterns, need of semantics for different purposes (e.g. data consistency), etc. is completely required. It is necessary to extend the state of the art not just to similar applications but to those topics covering linked data lifecycle, data modelling (specially for statistical data) and quality features for this type of data.
-Methodology and concept. Based on the previous comment, there are many linked data lifecycles and methodologies. Authors here introduce a process that should be based on existing lifecycles (extending or tailoring) or somehow instantiate one of them. Furthermore, the transformation process is too descriptive without entering in the details to consider when modelling statistical data: type of data, structure, etc. In this context, it is also relevant to consider how to deal with the issues when processing this type of data: solving missing values (if possible), ensuring data domain and ranges, keep consistency with the sources, updates, etc. In general, this section requires more details and tasks to really have a framework to transform statistical data in a systematic way. Otherwise, it is difficult to see the contribution in regard to other approaches that were publishing and linking data in other domains. From a technical perspective, the process of entity reconciliation: matching and linking, it is a cornerstone that requires details in the type of algorithms to be used (e.g. from classical approaches like the Prompt algorithm or reconciliation frameworks to others based on recent advances in embeddings representations). In terms of publishing, how to organize the datasets, slices, observations and model the data structure definitions is strictly necessary. The justification to select vocabularies (apart from the RDF Data Cube) is also completely necessary.
-Application and results. Authors implement the previous process to some domain, disease information, they have implemented with different technology the stages to gather data, transform, link, and publish. They also present an ontology and some rules to infer facts and check consistency. However, this case study is not properly documented, it should include some objectives, the implementation of the methodology: description of data (typology), description of the ontology (how it is defined, structure, etc.), description of the transformation (problems found), details on the quality metrics (not just the indicators), and how the quality checking process is performed.
As a final comment, the introduction and motivation of the paper is good but the state of the art, the conceptual approach (methodology), the quality checking process (indicators, metrics, and implementation), and the case study require more details to provide a systematic way of getting, linking, structuring and publishing statistical data.
Other comments:
-The abstract is correct, but it should include more details in the results and main contributions: a methodology, an application with the following results (N datasets, etc.)
-The structure of the paper is ok.
-“rdf data cube”, fix the first apparition and make the proper cite.
-What do you mean by RDF multidimensional models? Should not be RDF multidomain/cross domain models?
-References are relevant to the paper content but, as it has been commented before, there are some of them missing in terms of linked data lifecycles, data modelling, linked data patterns, linked data quality, etc.
|