Amsterdam Museum Linked Open Data
Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...
Revised manuscript after an accept pending major revisions, now accepted for publication. The reviews of the original submission are beneath the second round reviews.
Second round reviews:
Solicited review by Aba-Sah Dadzie:
The authors have done a good job of addressing the review comments. I'd recommend acceptance for the special issue, with a few minor corrections/additions. With regard to the specific requirements for this call:
* Quality of the dataset - this is well described and pointers to sample queries allow the reader to directly access the linked data.
* Usefulness (or potential usefulness) of the dataset - clearly contributes to the arts and cultural heritage. Further work planned by the authors indicates potential for further use and added value enabled by the conversion to LD.
* Clarity and completeness of the descriptions - the revised paper addresses concerns expressed in the 1st review. The use of existing standards and extensions to these are well described and referenced. The process followed in generating the dataset and the overall aims are also more clearly described.
* Name, URL, version date and number, licensing, availability, etc.
Version information and licensing information missing
* Topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.
* Metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth.
* Examples and critical discussion of typical knowledge modeling patterns used.
- Addressed in sufficient detail
* Known shortcomings of the dataset. - fairly well addressed, but see point below.
_____________________
Additional points to address
p.2 - "Although this approach ensures a level of consistency and interoperability between the datasets from different institutions it creates a disconnect between the cultural heritage institute original metadata model and the Linked Data version."
This begs the question "why"? Also, this appears to be contradicted in the final paragraph in this section.
p.5 - " Finally 34 persons were linked to persons in DBpedia. This is a relatively low number as 1) most of the Amsterdam Museum people are not notorious enough to appear in DBPedia" - do you need to be "notorious" to appear in DBPedia? "Famous" or "noted", maybe, but notoriety is normally considered to be negative or at best not complimentary.
Language & Presentation
Generally well written but needs a spelling & grammar check and proofread for minor errors. Among others,
p.4 - "proxies-aggregation" -> "proxy-aggregation"
Formatting of URLs
Some of the URLs them break because the formatting is splitting them and/or inserting whitespace at delimiters - requires the reader to copy the full URL and delete the white space inserted to reach the intended address.
Solicited review by Philippe Cudre-Mauroux:
This second iteration corrects the minor flaws of the first version of the paper (which I already liked actually). From my perspective, this paper is ready for publication.
Solicited review by Fabien Gandon:
my previous comments were addressed.
First round reviews:
Solicited review by Fabien Gandon:
This paper presents the Amsterdam Museum Linked Open Data set
The access point, content, metrics, statistics, modeling rationale, etc. are provided by the paper, which is in my opinion a very good contribution to this CFP.
"with suffix 'proxy-', 'aggregation-', 't-' or 'p-' for proxies, aggregations, concepts and persons respectively (eg. am:proxy-22476."
Don't you mean prefix ?
"There are also 34 links to DBpedia."
Any reasons for such a low number?
Solicited review by Aba-Sah Dadzie:
The paper describes the generation of the Amsterdam Museum Linked Dataset, as part of the Europeana project, to make more accessible information about the collection and people related to the various objects and the museum as a whole. The Linked Data was created by editing a "crude" RDF dump, to, among others, ensure interoperability with other cultural heritage data and the Europeana Data Model.
A few examples of use of the source data are given, and the authors discuss the benefits that the conversion to Linked Data is expected to bring. A specific example is the creation of a mobile tour guide. An overview of the data structure and work to promote interoperability with domain-specific models and more general standards are detailed. The work reported includes web services for querying and URLs from which to browse the data. The authors note the need to periodically regenerate the dataset to capture changes in the source data.
I have a few reservations about this paper. While the authors provide a good amount of information about the dataset and the technology used to create it, it reads more like a project report that ticks off a list of deliverables than a description of the linked dataset and the design process followed - what it should be. I would suggest, where Europeana is first mentioned, that the authors give a brief description of the project as an introduction to its relation to the generation of the linked dataset (that it is a project the authors are involved with is not obvious till the end of the paper). This should make it easier to understand the impact on Europeana, based on lessons learnt from the process followed - this is to a large extent a pointer to potential reuse, further enrichment and maintenance of the Linked Data generated, that is wasted. Also, the authors conclude with future work on Europeana, not the Amsterdam Museum dataset.
A critical discussion of design choices and the knowledge modeling is missing. There is no comparison with related work - the four references are all self-citations pointing to more detail on specific aspects of the work reported. While I don't expect a detailed literature review in a short paper, obvious areas where a review could be carried out are the usage section, the description of the model used and how this relates to or is an improvement on other similar, domain-specific datasets (Linked Data or otherwise). I acknowledge other models and schemas are mentioned - but these simply refer to URLs to a project page or other rather than indicate why they constitute a good or balanced decision.
DETAILED REVIEW
p.1
"While larger cultural heritage institutions such as the German National Library or British National Library have the resources to produce their own Linked Data, metadata from smaller institutions is currently only being added through large-scale aggregators such as Europeana."
This statement is open to debate - I would amend it to something like " smaller institutions often depend on large-scale aggregators such as Europeana." AND back the claim with an appropriate citation.
"published it as "five-star" Linked Data" - "five-star" should be cited, using Berners-Lee's "Linked Data - Design Issues" article (http://www.w3.org/DesignIssues/LinkedData.html)
p/2
"The Amsterdam Museum Linked Data set implements best practices that -, the together with its methodology and tools- Europeana is keen on adopting for its future workflow."
Need evidence to back this claim. Also, is there something missing after the hyphen?
I don't understand the resource URI derivation. A suffix terminates a word; however the example given has "proxy" (the suffix) followed by a numerical code. A complete example might be useful here.
"We used purl.org URIs since for this conversion we were not in the position to use the Amsterdam Museum namespace for our Linked Data server." - why not?
It is not completely clear what the RDF relations and the conversion of the language information at the end of S2.1 are till later in the paper - forward-referencing (annotated sections of) Fig.2 would be useful here.
p.3
"In total the object metadata consists of 5,700,371 RDF triples of which many have a thesaurus concept or person resource as object." - How many is "many"? - the word is too vague to be meaningful.
"Two Amsterdam Museum classes am:Exhibition and am:Locat were defined as rdfs:subClassOf of the EDM class edm:Event."
Are these two classes particularly meaningful or are they simply meant as examples?
"Most term-based thesauri, including the AM thesaurus, have a more or less uniform structure (ISO 25964) making" - what does the ISO standard mean or refer to here?
p.5
"Linked Culture Data web" - is this referring to a particular initiative?
==============================
Figures & Tables
The caption of Fig. 1 is too long (especially compared to the figure content) - I would suggest working the description into the main text and providing a more concise caption.
Figure 2 is referenced in the text before Figure 1 - Figure 2 should be brought forward.
Figure 2 caption - "... with their super-properties and -classes in italics." Does "-classes" imply "SUPER-classes" - if so this must be explicitly written, a "-" works as a shortcut for suffixes, not prefixes.
Citations & Bibliography
Footnote 3 (URL) goes to an administrator login
Footnote 4 (URL) displays a tiny XML file (to do with diagnostics)
While http://purl.org/collections/nl/am IS redirected to http://semanticweb.cs.vu.nl/europeana, the resource itself is not found.
The 'Object Re-use and Exchange (ORE) model' should be cited properly, using whichever is most appropriate of the papers by Lagoze & Van de Sompel et al.
Language & Presentation
A small number of typos (not listed here) will be caught by an auto spell check.
All acronyms must be expanded at first use. This is especially important for those not in wide use. E.g.,
- OAI-PMH interface (p.2)
- RDA Group 2 metadata standard (p.2)
English uses "," as a 1000 delimiter - unlike some other languages which may use ".". Simply because this paper is in English it should use the "," convention. More importantly, usage should be consistent - this paper uses both. (p.2,4) This gets even more confusing on p.4 where "." is used in a sentence as a decimal point, and then also used as a 1000 delimiter.
p.3
"These properties are mapped to RDA Group 2 elements using 20 rdfs:subProperty relations were defined." -> "These properties are mapped to RDA Group 2 elements using 20 rdfs:subProperty relations."
p.5
"Where the current Linked Data pilot of Europeana (data.europeana.eu) focuses on producing a Linked Data set based on the already-ingested metadata consisting of a minimal set of Dublin Core properties."
This is not a sentence.
Solicited review by Philippe Cudre-Mauroux:
This short paper describes the Amsterdam Museum LOD. The paper starts by describing the modeling and conversion methodologies (basically, the metadata and vocabulary were exported from OAI-PMH as XML, converted to RDF, curated using rewriting rules, and finally interlinked). Mapping onto the Europeana Data Model (EDM) was carried out using subclasses and subproperties on the one hand, and a proxy-aggregation pattern (supporting EDM's multiplicity of providers by separating object metadata and provenance metadata) on the other hand. The resulting LOD is available online over HTTP (serving HTML/RDF-XML/Turtle by content negotiation) and as a GIT repository.
Overall, I liked the paper and found the description of the modeling/export process as well as the description of the data itself interesting and technically sound. The methodology used to produce the LOD is rather standard, but nevertheless compelling and gives a down-to-earth, pragmatic account of how to export local cultural data and map it to EDM. I inspected a number of entities online using the HTML interface and all the metadata I looked at appeared to me as correctly and precisely exported, hence supporting the claim of the authors (i.e., preserving the original richness of the data being LODified). The paper would in my opinion be stronger if it included i) more information about the linking process (e.g., high-level description of the Amalgame alignment platform, precision/recall results for the automatically created links) and ii) additional detail on the update procedure (how efficient/scalable is the overall export procedure? Is there any bottleneck? Would incremental updates be possible? etc.)
Comments
Response to reviewers
First of all, we would like to thank the reviewers for their very useful and concise comments. Below, we list our responses to the individual issues raised by the three reviewers. These also list the changes made in the resubmitted document.
================
REVIEWER 1
ISSUE: "with suffix 'proxy-', 'aggregation-', 't-' or 'p-' for proxies, aggregations, concepts and persons respectively (eg. am:proxy-22476." Don't you mean prefix ?
RESPONSE: This has been corrected and explained in more detail including a complete example.
ISSUE: "There are also 34 links to DBpedia." Any reasons for such a low number?
RESPONSE: Specified that the links are to persons in DBP and added the following text: "There are also 34 links to persons in DBpedia. This is a relatively low number as 1) most of the Amsterdam Museum people are not notorious enough to appear in DBPedia and 2) we used fairly simplistic matching algorithms here."
REVIEWER 2
ISSUE: I would suggest, where Europeana is first mentioned, that the authors give a brief description of the project as an introduction to its relation to the generation of the linked dataset (that it is a project the authors are involved with is not obvious till the end of the paper). This should make it easier to understand the impact on Europeana, based on lessons learnt from the process followed - this is to a large extent a pointer to potential reuse, further enrichment and maintenance of the Linked Data generated, that is wasted.
RESPONSE: We have added a paragraph in the Introduction explaining our relation to Europeana:
"A large part of the research that has resulted in the dataset described in this document was carried out within the context of EuropeanaConnect [http://www.europeanaconnect.eu/]. Within EuropeanaConnect, different core technologies and components for Europeana were developed, including the methodology and tools of which the Amsterdam Museum Linked Data set is the result. The dataset has been included in the Europeana Thoughtlab, a set of innovative technologies and tools that lead the way for Europeana-related developments[http://pro.europeana.eu/web/guest/thoughtlab/linked-open-data]."
Secondly, we have added the Europeana affiliation of one of the co-authors to further clairfy our relation to Europeana.
ISSUE: Also, the authors conclude with future work on Europeana, not the Amsterdam Museum dataset.
RESPONSE: In the Discussion section, have added a description of current work on the use of Amsterdam Museum LD which will inform a next version of the dataset.
"Future work on the data set includes efforts to produce more links, both to Amsterdam and Dutch cultural heritage datasets and vocabularies as well as to more general vocabularies such as VIAF or DBPedia. At the same time, we plan to validate our design choices by developing a number of Web and mobile applications that combine the Amsterdam Musem data with other datasets. One example is a mobile cultural tour guide for the city of Amsterdam in which the Amsterdam Museum Linked Data set will be a central datasource, which we are currently developing. In another project, we combine the Amsterdam Museum data with World War II archival and library data. We expect that these efforts will also provide us with feedback on the specific design choices made, which can inform next version."
Also, we have added a section on the location of the data:
"Currently, the Amsterdam Linked Data set is hosted on the ESL. We are looking at the feasibility of having the data hosted by the Amsterdam Museum itself, which could contribute to persistency and maintainability of the data. The use of PURL URIs allows us to redirect HTTP request to a second server."
ISSUE: A critical discussion of design choices and the knowledge modeling is missing. There is no comparison with related work - the four references are all self-citations pointing to more detail on specific aspects of the work reported. While I don't expect a detailed literature review in a short paper, obvious areas where a review could be carried out are the usage section, the description of the model used and how this relates to or is an improvement on other similar, domain-specific datasets (Linked Data or otherwise). I acknowledge other models and schemas are mentioned - but these simply refer to URLs to a project page or other rather than indicate why they constitute a good or balanced decision.
RESPONSE: we have added a section 2.2 elaborating on the design choices, specifically those made through our mapping of EDM. We have also added a number of references and citations to related models:
"Within the cultural heritage domain, a number of metadata schemas exist. Popular schemas for museums are models such as Dublin Core (DC)[5], Visual Resources association (VRA)[http://www.vraweb.org/projects/vracore4/], the Lightweight Information Describing Objects schema (LIDO)[9] or the CIDOC Conceptual Reference Model (CRM) [3]. EDM is not built on any particular community standard but rather adopts an open, cross-domain Semantic Web-based framework that can accommodate the range and richness of these schemas. It has been tested for compatibility with other community standards such as the Encoded Archival Description (EAD)[http://www.loc.gov/ead] for archives and the Metadata Encoding and Transmission Standard (METS)[http://www.loc.gov/mets/] for digital libraries[4].
In fact, EDM mainly re-uses or draws inspiration from elements belonging to other standards. DC, CIDOC-CRM, SKOS are used for ``descriptive'' metadata.[To a great extent, our interoperability approach is therefore easily transferable to application contexts that exploit other descriptive metadata element sets.] For person metadata, EDM uses the Resource Description and Access (RDA) Group 2 metadata standard [http://rdvocab.info/ElementsGr2]. These properties include given and family names, birth and death dates etc.[Europeana's choice for RDA was partly informed by the research presented in this paper.] For more ``technical'' and ``organization-related'' metadata aspects, requirements specific to large-scale aggregation and access to digitized resources have been taken into account, making EDM a fairly unique proposal as these scenarios emerged only recently. EDM for example supports multiple providers describing the same object and allows for enrichment of the museum data, while clearly showing the provenance of all the data that links to digital objects. This is achieved by incorporating the \textit{proxy-aggregation} pattern from the Object Re-use and Exchange (ORE) model\cite{lagoze2008ore}. For more details and discussion on EDM we refer the reader to the paper on the data.europeana.eu dataset in this special issue [7].
ISSUE: "While larger cultural heritage institutions such as the German National Library or British National Library have the resources to produce their own Linked Data, metadata from smaller institutions is currently only being added through large-scale aggregators such as Europeana." This statement is open to debate - I would amend it to something like " smaller institutions often depend on large-scale aggregators such as Europeana." AND back the claim with an appropriate citation.
RESPONSE: We have amended our paragraph as suggested and added some statistics regarding the number of metadata sets that are aggregated through Europeana. We feel this adequately conveys that most (smaller) institutions have their data published through an aggregator. We have also added a reference to the article by Isaac et al published in the same special issue reporting on this:
"While larger cultural heritage institutions such as the German National Library[https://wiki.dnb.de/display/LDS/] or British National Library[http://bnb.data.bl.uk] have the resources to produce their own Linked Data, smaller institutions often depend on large-scale aggregators such as Europeana. Europeana aggregates metadata from more than 2200 European cultural heritage institutions and provides access through its Web portal[http://www.europeana.eu]. The Europeana Linked Data pilot[http://data.europeana.eu] uses metadata from 200 of these institutions and provides Linked Open Data access, which is described in the data.europeana.eu paper in this special issue[7]."
ISSUE: "published it as "five-star" Linked Data" - "five-star" should be cited, using Berners-Lee's "Linked Data - Design Issues" article (http://www.w3.org/DesignIssues/LinkedData.html)
RESPONSE: We have added this citation.
ISSUE: "The Amsterdam Museum Linked Data set implements best practices that -, the together with its methodology and tools- Europeana is keen on adopting for its future workflow." Need evidence to back this claim. Also, is there something missing after the hyphen?
RESPONSE: We have clarified this through references to the Europeana Thoughtlab and the EDM Primer:
"The Amsterdam Museum Linked Data set implements best practices that, together with its methodology and tools, Europeana is keen on adopting for its future workflow. This is exemplified by the inclusion of the dataset in the Europeana Thoughtlab and by the adoption of specific modeling choices in newer versions of the Europeana Data Model primer (for example, the method of achieving interoperability as described in Section Section 2.3 is adopted in [6, Sect. 5.4]). "
ISSUE: I don't understand the resource URI derivation. A suffix terminates a word; however the example given has "proxy" (the suffix) followed by a numerical code. A complete example might be useful here.
RESPONSE: This has been explained in more detail including a complete example
ISSUE: "We used purl.org URIs since for this conversion we were not in the position to use the Amsterdam Museum namespace for our Linked Data server." - why not?
RESPONSE: We have clarified this further in section 2.1: "At the time of conversion, we did not have access to the Amsterdam Museum web servers. This meant we could not have URIs in the Amsterdam Museum namespace redirect to our semantic server so that they can be resolved. We therefore employed the use of purl.org URIs." We have also added a short paragraph on this in the future work section: "Currently, the Amsterdam Linked Data set is hosted on the ESL. We are looking at the feasibility of having the data hosted by the Amsterdam Museum itself, which could contribute to persistency and maintainability of the data. The use of PURL URIs allows us to redirect HTTP request to a second server. "
ISSUE: It is not completely clear what the RDF relations and the conversion of the language information at the end of S2.1 are till later in the paper - forward-referencing (annotated sections of) Fig.2 would be useful here.
RESPONSE: We have added a forward reference, both to section 3.4 as well as Figure 2.
ISSUE: "In total the object metadata consists of 5,700,371 RDF triples of which many have a thesaurus concept or person resource as object." - How many is "many"? - the word is too vague to be meaningful.
RESPONSE: This has been specified to: "975,859 triples have a thesaurus concept as object and 210,407 triples have a person resource as object."
ISSUE: "Two Amsterdam Museum classes am:Exhibition and am:Locat were defined as rdfs:subClassOf of the EDM class edm:Event."
Are these two classes particularly meaningful or are they simply meant as examples?
These class mappings are not specifically more meaningful than other information. The sentence was added since it makes the description of the RDFS schema file complete. For clarification we added: "Instances of these two classes are used to describe Amsterdam Museum-specific events related to the cultural heritage object."
ISSUE: "Most term-based thesauri, including the AM thesaurus, have a more or less uniform structure (ISO 25964) making" - what does the ISO standard mean or refer to here?
RESPONSE: We elaborate on the ISO standard (reverted to 2788 for clarity) and refer to van Assem et al for the conversion to SKOS: "Most term-based thesauri, including the AM thesaurus, have a more or less uniform structure, based on the ISO 2788 standard[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?c.... This standard defines two types of terms (preferred and non-preferred) and five relations between terms: broader, narrower, related, use and use for. Use and use for are allowed between preferred and non-preferred terms, the others only between preferred terms. Translating thesauri from this format to SKOS is fairly straightforward and well documented (cf. [10]). The Amsterdam Museum Thesaurus constructs are mapped to SKOS."
ISSUE: p.5 "Linked Culture Data web" - is this referring to a particular initiative?
RESPONSE: This does not refer to a specific inititative. To clarify, we have rephrased this to "...and we expect that the Linked Data version will be an equally central data set for the web of cultural heritage Linked Data. " Also, the subsequent new future work paragraph specifies our current efforts in making this a reality.
============================== Figures & Tables
ISSUE: The caption of Fig. 1 is too long (especially compared to the figure content) - I would suggest working the description into the main text and providing a more concise caption.
RESPONSE: we have moved the description into the main text.
ISSUE: Figure 2 is referenced in the text before Figure 1 - Figure 2 should be brought forward.
RESPONSE: This has been fixed
ISSUE: Figure 2 caption - "... with their super-properties and -classes in italics." Does "-classes" imply "SUPER-classes" - if so this must be explicitly written, a "-" works as a shortcut for suffixes, not prefixes.
RESPONSE: This has been fixed
ISSUE: Footnote 3 (URL) goes to an administrator login
RESPONSE: This URL was changed to http://collectie.amsterdammuseum.nl/. This has been fixed
ISSUE: Footnote 4 (URL) displays a tiny XML file (to do with diagnostics)
RESPONSE: This footnote now gives an example URI rather than the root URI: "http://amdata.adlibsoft.com/wwwopac.ashx?database=AMcollect&search=creat... "
ISSUE: While http://purl.org/collections/nl/am IS redirected to http://semanticweb.cs.vu.nl/europeana, the resource itself is not found.
RESPONSE: http://purl.org/collections/nl/am is not a URI of a resource, but the basename used to build up AM resource URIs. As such, it should not be resolvable (although it is redirected by PURL).
ISSUE: The 'Object Re-use and Exchange (ORE) model' should be cited properly, using whichever is most appropriate of the papers by Lagoze & Van de Sompel et al.
RESPONSE: Added citation to Lagoze & Van de Sompel et al (2008)
ISSUE: A small number of typos (not listed here) will be caught by an auto spell check.
RESPONSE: we ran a spell check and fixed the found typos
ISSUE: All acronyms must be expanded at first use. This is especially important for those not in wide use. E.g.,- OAI-PMH interface (p.2) - RDA Group 2 metadata standard (p.2)
RESPONSE: We expanded these acronyms as well as SKOS (sec 2.2) and VIAF (p.5). (Semantic) Web technological acronyms (HTTP, URI etc) have not been expanded.
ISSUE: English uses "," as a 1000 delimiter - unlike some other languages which may use ".". Simply because this paper is in English it should use the "," convention. More importantly, usage should be consistent - this paper uses both. (p.2,4) This gets even more confusing on p.4 where "." is used in a sentence as a decimal point, and then also used as a 1000 delimiter.
RESPONSE: this has been fixed
ISSUE: "These properties are mapped to RDA Group 2 elements using 20 rdfs:subProperty relations were defined." -> "These properties are mapped to RDA Group 2 elements using 20 rdfs:subProperty relations."
RESPONSE: this has been fixed
ISSUE: "Where the current Linked Data pilot of Europeana (data.europeana.eu) focuses on producing a Linked Data set based on the already-ingested metadata consisting of a minimal set of Dublin Core properties." This is not a sentence.
RESPONSE: this sentence has been fixed
REVIEWER 3 Solicited review by Philippe Cudre-Mauroux:
ISSUE: The paper would in my opinion be stronger if it included i) more information about the linking process (e.g., high-level description of the Amalgame alignment platform, precision/recall results for the automatically created links)
RESPONSE: To section 3.5, we have added a high-level description of the Amalgame alignment platform and descriptions of the actual alignment workflows used to obtain the mappings. These workflows included the results of manual evaluations, which give indications of the precision of the produced links.
ISSUE: additional detail on the update procedure (how efficient/scalable is the overall export procedure? Is there any bottleneck? Would incremental updates be possible? etc.)
RESPONSE: We have expanded the Discussion section to address the efficiency and scalability of the entire conversion workflow as well as the issue of incremental updates.