Review Comment:
Both the article and the dataset still need significant work in order to be publishable.
A big problem I have with the article is the haphazard nature of what information has been included and what has been left out. Particularly, many aspects that would help people to actually use the dataset are missing, as well as discussions that would help in determining the relevance of the dataset to a particular task.
First, the statement on licensing should be worded better. Now you're first saying that all RDF data is CC0, but in the very next sentence you say that this doesn't extend to the original metadata, which have restrictions of their own. For a person to evaluate the possibility of using this dataset, these distinctions and their ramifications should be discussed in much more detail, going through all subsets and noting which licenses apply to which parts of the original data.
In addition to license metadata, more information on the actual contents of the subcollections would also be of use. For example, elsewhere it is stated that the quality of the metadata ranges widely between the subcollections, with some making references to agent, place and subject authorities, while others do not. For a potential user of the dataset, this is essential information to know, along with information on exactly which reference authorities are used, etc. It also seems a lot of the data only has labels in German. This would be good to point out.
It would also help to have instance counts associated with each individual collection instead of just aggregate numbers. Also state upfront the amount of total documents vs individual pages of the documents, instead of hiding this information by forcing the reader substract the count for individual pages from the total count of CHOs (If I counted correctly and adequately accounted for the duplication inherent in EDM, you should have some 83,000 documents in your repository?).
I would also like the article to contain concrete examples of what the data is good for, going for example through a search and browse session that highlights some interesting connections in the data.
The data model should also be described more concretely. First, the example associated with the data model should be moved much earlier in the section, and should be used as a focus to highlight the various aspects of the model. Then, more detail on the model itself should be provided.
For example, how does the model actually align with EDM? It is stated that the DM2E model is an application-specific specialization of EDM. However, later we learn that for ingestion into Europeana, the model actually undergoes a transformation from the DM2E model to the EDM model, implying that they are distinct. This really needs to be spelled out.
It would also be interesting to know from a modeling perspective what changes were made between e.g. the 1.1 and 1.2 versions of the model, and what caused them.
There are also design decisions made in the model that I would not have done, and would therefore like to be rationalized. First, I would think the choice of which items to show would be an application specific one. Thus, putting the dm2e:displayLevel -property on the same level as more neutral metadata strikes me as odd, particularly as this info seems to be something that can be directly inferred from the hierarchy. The same goes for the dm2:levelOfHierarchy -property, even though here I can see the use in providing this information shortcut as a means of lessening query time complexity.
The model also currently seems to contain a lot of other duplication that could just be inferred at query time, such as associating the organization and creator resources with each individual page of a manuscript in addition to the manuscript as a whole.
As I said, these are probably justified design choices, but I'd just want to see the justifications.
Regarding access, at the time of the writing this review, the SOLR search API did not function (but did in earlier testing). As the LD site for the most part doesn't disclose incoming references to concepts, this severely limits the ability of a LD agent in browsing and searching the site (e.g. I cannot currently find out which items refer to the concept Sagengestalt, because the RDF at http://data.dm2e.eu/data/concept/onb/authority_gnd/4221861-5 does not return this). Related to this, it would also be extremely beneficial if the data could be made available as a SPARQL service. At the very least, providing a data dump in addition to the current LD site would be a must.
Regarding the integration with Pundit and Feed, a concrete example would help. I was unable to find any dm2e:hasAnnotatableVersionAt properties on the items I browsed. Similarly, with regard to statement-level provenance, the example given didn't actually contain what was described in the paper.
From the presentation, I couldn't parse out how the whole of provenance and versioning work. On exploration, it seems that the different VoID dataset versions record which items were produced in each run, along with batch details. However, it seems the items are recorded in these version datasets using their unversioned identifier, so that in actuality, also earlier version datasets always refer to the newest version of the item! I also didn't find the versioning links between the resource maps that were advertised in the paper, which would have remedied this situation somewhat.
Minor comments:
* In the introduction, a reference is made to "Scholarly Primitives" with an outside reference. If these are important enough to be mentioned, they should be spelled out also in this paper.
|