Review Comment:
In general, I like the content of the article and think it deserves publication. However, I feel there is still work to be done in cleaning up and expanding the text, as well as fixing bugs in the dataset its homepage.
Going chronologically, the Related work section of the introduction is puzzling. In my opinion, much of what is presented as related is really far from the actual content of the paper, and a lot that would be of relevance is missing. I'd suggest as related work other 1) other projects and tools that do Excel/CSV to RDF conversion, 2) other projects that publish data as RDF data cubes (or publish similar data even if it is not as data cubes) (I do have to note however that most of the referenced works I find puzzling actually come from the suggestions of a prior reviewer, so can understand your dilemma). As a small note, there is a typo in "knwoledge graph".
Similarly, in section 2.1, it is stated that "well-known generic community tools" cannot be used. However, no actual references to such tools are given. Such a statement would also need at least a couple of sentences of argumentation with relation to those tools.
It is not easy to understand Listing 1, when not even an example dimension resource is included. I would also otherwise like the examples to cover the the information associated with the different types of row and column header resources.
In the article, multiple links to GitHub resources no longer resolve due to refactorings. These should be changed to point to a particular tagged version so this does not happen.
Also, presently, multiple of the example queries on the dataset query page do not work. These should be fixed.
In the article, as well as on the dataset page, sometimes the term/graphs cedar-mini appear instead of cedar. The relationship between these two is not explained, and hampers use of the dataset.
For many of the tables in the article, the values depicted are not adequately explained. For example, what exactly is the frequency/% in Table 2 a portion of? What does SPARQL as a means of generation mean in Table 5, and so on.
In the article, links to outside data sources are mentioned in passing in several places, but not comprehensively discussed. Here, an additional table would help listing all outside link targets, as well the how these links were generated.
Inside the dataset itself, some of the URIs are currently not dereferenceable (having a port of :8888). Also, some OA resources are in my opinion unnecessarily blank nodes, which hinders their exploration.
The property cedar:isTotal is present in a table in the paper, but not discussed. However, looking at the example queries on the site, this seems to be an important property that discerns different types of observations from each other. Thus, its significance should be explained.
Finally, and most importantly, I do not think that the analysis of dataset (rule) coverage is adequate. For example, at present one does not really know how much of the information of the original data is transferred into the final representation. One would need at least percentages for the number of different codes handled thus far vs the total number of codes, as well as a qualitative evaluation on for example how many of the dimensions in total of the original have been thus far mostly mapped. The paper states that these questions are answered by the "full statistical analysis" on the dataset home page, but at least for me this was not the case. On that page, I only see the number of total sheets processed, but not deeper detail into how much of their information made its way into the end result.
|