Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
The paper is original and presents significant results. The quality of writing could be improved, so I suggest acceptance with minor (editorial, non scientific) revision. A detailed review is given below.
The paper presents a method and a set of associated tools for transforming archival data expressed in ISAD(G) and ISAAR(CPF) into equivalent data expressed in the CIDOC-CRM, to the end of making the original data interoperable with a larger set of applications. This objective is very important because archives are fundamental information sources for knowing our past and better understanding our present. Making them more interoperable is therefore paramount, and a CIDOC CRM is an ideal choice since it is a well-known and largely used standard in the Cultural Heritage domain. The method and the tools presented in the paper succeed in realizing this objective, facing and solving several conceptual and technical problems, therefore I recommend acceptance.
On the other hand, the paper may prove difficult to read for the non-initiated, therefore I also recommend some revisions to the presentation, detailed below section by section.
Section 1 clearly sets the context, the objectives and the achievements of the paper.
Section 2 provides a detailed and well-argued state of the art.
Section 3 would benefit from some modification.
• Figure 1 is hard to read, I recommend enlarging the text in the boxes
• In section 3.2, the sentence “the description level of a unit is Fonds” is analyzed as follows: “description level” is akin to a class, “Fonds” is akin to an individual, while the whole sentence asserts membership of the individual “Fonds” to class “description level”. In other words, the authors interpret the sentence above as the sentence “Fido is a dog”, which asserts membership of the individual “Fido” in the class “dog”. I believe this interpretation is flawed, because it considers the notion “description level” as an independent entity, while that notion depends on the notion named “unit”, which the interpretation actually ignores. More simply, the sentence above is akin to a sentence like “the birthplace of a person is Place” which is a categorical assertion about universals and is typically interpreted in semantic modelling as follows: “Person” and “Place” are classes, “the birthplace of” is a property, and the sentence asserts that the domain of property “the birthplace of” is class “Person” and its range is class “Place”. Similarly, “the description level of a unit is Fonds” asserts that property “the description level of” has class “unit” as domain and class “Fonds” as range. By interpreting “description level” as a class the damage is done, and it is not repaired by reverting to a type in place of a class, because the two notions are akin and entirely different from the notion of property. The misinterpretation continues with the next sentence analyzed by the authors: “The description level of the unit with reference code 41 xxx is Fonds” (en passant, the NLP tagging of the sentence with PoS elements does not help detecting the mistake, so perhaps it is not so important and can omitted in the sake of brevity). Here, the obvious reading would be that the individual unit identified by code xxx has Fonds as description level. The fundamental difference from the previous sentence is that this sentence is not categorical, but merely expresses factual knowledge about a specific unit, identified by means of another property, “with reference code”. So here we have (a) two classes, “unit” and “Fonds”; (b) two properties, “the description level of” and “with reference code”; and (c) two individuals, the unit which is the subject of the assertion and the code “xxx”. But of course, this is not the interpretation of the authors who insist on “the description level of” being a class “Description_Level”. I urge the authors to fix these issues, which do not have any impact on the mapping to CRM, it is just a rhetorical issue.
• Table 1 in Section 3.3 is very hard to read and I wonder whether it would be more appropriate to present the rules in the form they are given in Appendix A, moving Table 1 to the Appendix. In the same Appendix, I’d move the six commands that produce the transformed data and the presentation of the workflow with the examples. I understand the authors are very proud of these technical aspects of their work, but placing them in the middle of the paper may create a barrier for the reader, who first needs to understand the concepts, and then, possibly, the way these concepts have been implemented. Note also that the concepts are far more important from the technical implementation, because the latter may change due to many contingent matters, while the former are set once forever.
• Another issue with Section 3.3 is the way the authors handled so-called “.1 properties” of the CIDOC CRM, such as for instance P1.1 (illustrated in Figure 4 to 6). I spent a few hours to find out the CIDOC CRM recommendation for dealing with these properties that the authors refer to. I finally found it, it’s in a PowerPoint presentation on a web page on the CIDOC CRM website (http://new.cidoc-crm.org/Resources/modeling-properties-of-properties-in-...). Please, insert this reference in the paper, either as a footnote or as a reference proper.
• Related to the previous point, I did not find in the paper any statement reporting which version of the CIDOC CRM the authors have used. I believe they used is version 6.2 and its RDF Schema expression, but this needs to be stated clearly. Notice that the latest version of CIDOC CRM is 7.1 which I believe is fully compatible with the one used by the authors, but they have to check this.
Section 4 is fine, except for Table 5, which presents technical rules that are likely to be of interest only for the implementors. Please consider moving this Table in an Appendix as suggested for Table 1, leaving in this Section only a conceptual illustration. Another option is to drop the Table from the paper, presenting only one rule as an example and referring to a technical document for the complete set of rules.
Section 5 is fine, except that I found missing quotes (for instance on the right-hand side of query DLq1) and that the queries are not very readable in the present form. The authors should consider indentation as a way of highlighting the structure of the queries, perhaps using a different font to save space. Or using a Table for presenting the queries, thus availing of the full page width.
Section 6 in my opinion can be made much shorter, if not removed altogether. The reason is that this Section mostly discusses quality issues of the data that the authors have been working with. These issues are not related to the strategy presented by the paper, and anyway they can be dealt with by using specific techniques, such as de-duplication or lexical data management. So I would urge the authors to leave these issues out, perhaps just mentioning them en-passant.
Section 7 is fine.
|