Review Comment:
As suggested in my earlier review, the paper was now submitted as a Dataset Description. This type is better suited for the paper. The authors do need to draw very close attention to the call for papers for dataset descriptions, though. There are still characteristics of the dataset for its usage missing, e.g., spelling out the name, the URI, the version date and number, licensing, availability. Also, what is still missing is evidence of its use. Also, while datasets in a Government data repository are probably fairly save in their long term availability, a w3id identifier, DOI and availability on Figshare or Zenodo would be advisable.
Beyond the advertisement of the dataset and reporting its importance, a dataset description paper should also serve the purpose of lessons learned in the generation of the dataset. The paper still falls a bit short in this aspect. Two main issues:
- The transformation process is not described in detail and is not repeatable, i.e., someone with a similar problem cannot follow a methodology presented in this paper. There is some mentioning that the ontology was build manually, but then how was the mapping done. It looks like it was all done with custom scripts. Why not using mapping tools such as the ones listed here: https://github.com/semantalytics/awesome-semantic-web. Potential tools for the process could be any2rdf, triply, tarql, J2RM or any23.
Also, to be of real value to others, these tools should be configured (a wrapper built around them) to automate the process for the widely used data portal used for the Nova Scotia Open Data portal, i.e., Socrata.
- The level of human intervention is also not clear. It is stated in the conclusion that there was a tool developed to retrieve open datasets, but the identification of disease datasets was carried out manually. This is unclear. And the methodology section needs to clearly state which parts are manual and which parts can be automated. For example, the ontology development was obviously manual, but the mapping to the ontology can be automated using the aforementioned tools.
Overall, the paper can be shortened in several sections to save space to clearly describe the dataset, the methodology and the ontologies developed. For example, the description of the data portal itself is too long. Table 1 is not needed, as these seem to be the datasets in its original format, at least it appears to me when I click on them. Figure 3 shows an example observation. It would be better to first present the ontology in detail and then show an example instance. Section 4.4 is redundant. It seems as if these are just some proof of concept SWRL rules. Are they deployed and used? If not, they should not be included in the paper. Section 4.6 about the queries looks again like a proof-of-concept. Are these predefined SPARQL queries using SPIN or SHACL advanced features and can be used through the portal? If not, again, it should not be in the paper. Wikidata has a SPARQL interface with predefined queries. This is one way of helping users to access the data.
There should also be a stronger focus on lessons learned for jurisdiction to deal with Linked data. There are several Government Linked Data Working groups globally (and the W3C) that publish (have published) guidelines and recommendations on best practices. If they have been followed, what aspects of those had to be changed/customised for the local context and which one's were applicable.
There are a few language issues in the paper that need to be addressed, e.g.,
"a multi-dimensional structure should be defined consists of measures, and dimensions describing the measures" misses a verb
"As a proof of concept, we designed a SWRL rule to infer the transitive relationship of diseases in a dataset using Protege rule engine"
"downing the road"
There are also some formatting errors such as ??? for Figures and references.
The reference list is also ill-formatted and not consistent. See FAQ10
|