Review Comment:
This manuscript was submitted as 'Application Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described application (convincing evidence must be provided). (2) Clarity and readability of the describing paper, which shall convey to the reader the key ideas regarding the application of Semantic Web technologies in the application.
-----------
The contribution describes the migration of a library catalogue into linked open data. The resulting data is primarily based on the RDF vocabulary for RDA (Resource, Description & Access), but also other vocabularies are used. The emphasis in the article is on the migration process, the underlying conceptual model (IFLA’s FRBR model), and the vocabularies used in the final coding. The article is submitted as an “Application Report”, but it does not describe any particular application that uses this data, and should rather be considered as a “Linked Dataset Description”. The project has earlier this year been presented as a poster on TPDL 2015, accompanied by a short paper in the proceedings, but this contribution is sufficiently different/elaborated to justify a new publication.
The paper is well written and easy to read, reasonably well organized and the figures are relevant and of good quality. The contribution is relevant for others that are looking for examples on how to migrate library catalogues to linked open data. The potential impact, however, is somewhat limited because the main focus is on the process, the software and vocabularies used, rather than the quality and reuse value of the resulting data set.
Given that this is a semantic web journal, I find the introduction a bit elementary and appears to be written for readers without any prior knowledge about semantic web technologies. The statement about RDF being based on XML can be revised because RDF better is presented as a graph-based data model independent of XML. The motivation for this work, however, is well described. The description of the RDF-vocabulary developed for RDA is acceptable, but the authors could distinguish better between RDA as a standard for descriptive cataloguing and the RDF-vocabulary developed for the elements and relationship designators in RDA. The authors argue that there is need for data processing before bibliographic records can be published as linked open data, which of course is perfectly correct, but it is difficult to figure out what they mean by “encoded using heterogeneous library standards”.
In “Related work” the authors demonstrate knowledge about recent comparable initiatives/projects in the library domain, but a I miss references to research on the process and problems related to transforming library records to the FRBR model and get the impression is that this is largely ignored in this project. The transformation process is described merely from a technical technical point of view, but the main challenge is quality in terms of semantic correctness in the result. Please check e.g. the paper by Decourselle from the TPDL 2015 proceedings, earlier papers by Aalberg et. al. (this reviewer), Manguinas et. al., M. Yee, as well as papers from the Variations2-project etc.
The relational database that is used internally to store the Biblioteca Virtual Miguel de Cervantes appears less interesting and the presentation could benefit from a more dominant focus on the final ontology instead. After all, it is the final output of triples that will be exposed to others. I also miss a better description on the logic applied in the transformation. The guidelines by LC they refer to are primarily a mapping between properties which typically has to be accompanied by some interpretation logic for identifying and relating the entities described in each record.
Results are presented in the final section and the claim is that the procedure has been able to automatically transform a reasonable number of records “successfully”. The main problem in the result section is lack of discussion on what they have succeeded with. The quality is merely discussed from a syntactical point of view, described by counting classes, properties, triples and entities that are linked to external collections (such as VIAF). A more in depth discussion/analysis of the the data in the context of its use, is needed to show the actual quality/reuse value of this data. A simple query performed by this reviewer on the SPARQL endpoint for works having Cervantes as author, returns a listing of 408 works (?), including numerous works with different URI but titles that indicate equivalent entities (“El ingenioso hidalgo don Quijote de la Mancha” and variants of this title). This simple test indicates that the problem of deduplication and erroneously identified instances is ignored in this project which also was my impression from reading the article. This reviewer may of course be wrong, but the paper does not give any evidence of the opposite. The result is a collection that implements the vocabularies of RDA and shows what can be done in terms of coding data for the semantic web, but does not recognize and deal with the migration problems and quality issues that have been identified in previous research.
Main conclusion is that this potentially is a relevant contribution, but there is a need for major revision before it can be accepted. In particular, the authors should include a discussion on the typical migration problems and quality issues that others have identified for transforming MARC-data into FRBR. Secondly, they need to include some evidence demonstrating the actual quality of the final data e.g. by looking at the results for known cases, counting duplicate as well as erroneously generated entities etc. Given that the proper migration of library data into richer semantic models such as FRBR coded using the RDA-vocabulary is a very hard problem, I do not expect such data to show perfect results, but I do expect a thorough discussion about well known migration challenges, the solutions they have implemented and evidence on the results they have achieved.
|