The AGROVOC Linked Dataset

Paper Title: 
The AGROVOC Linked Dataset
Authors: 
Caterina Caracciolo, Armando Stellato, Ahsan Morshed, Gudrun Johannsen, Sachit Rajbahndari, Yves Jaques, Johannes Keizer
Abstract: 
Born in the early eighties as a multilingual authority file of agricultural index terms, AGROVOC has steadily evolved these last thirty years, moving to an electronic version around the year 2000 and shortly thereafter embracing the Se-mantic Web. Today AGROVOC is a SKOS-XL concept scheme published as Linked Open Data cloud, containing links (as well as backlinks) and references to many other Linked Datasets in the LOD cloud. In this paper we provide a brief historical summary of AGROVOC and detail its specification as a Linked Dataset.
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 
Decision/Status: 
Accept
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Revised resubmission after an "accept pending major revisions", then "accept pending minor revisions", then "accept". First round reviews are beneath the second round reviews, which are beneath the third round review.

Solicited review by anonymous reviewer:

1. Reference [5] is still missing institution name and report type.

2. No need to mention the current implementation issues, if the
authors are sure that they will be really solved before the
publication of the paper. Otherwise, the paper should include at
least
a general statement, informing the reader that work in underway to
fix
a number of implementation issues concerning the publication of the
AGROVOC thesaurus according to Linked Data principles.

Secound round reviews:

Solicited review by Marta Sabou:

The paper has been extended to address all my comments and suggestions, and therefore if should be accepted as is. Given the various details included, I think that the extra page is well worth it.

Minor issue: The caption of Figure 2 should be capitalized.

Solicited review by Willem van Hage:

The paper still lacks concrete descriptions, like by means of a few examples, of how terms,
their properties, and links to other vocabularies are modeled in practice.
The paper would benefit from a more concrete description of modeling issues encountered.
Regardless of that, I think the paper is good enough for publication.

Solicited review by anonymous reviewer:

The new version of the paper addresses most of the issues pointed out in my previous review, with a few exceptions, which I discuss herebelow:

1. Versioning support. It is still unclear whether versioning support for AGROVOC is actually planned, or it is something that is not going to be provided in the near future.

2. References. References [5] and [10] are still incomplete (missing title).

3. I would like to thank the authors for the feedback they provided on the implementation issues. However, I would suggest the authors briefly acknowledge that some work is underway to fix some issues in the implementation of AGROVOC according to Linked Data principles. Such a statement might be included at the end of Section 4.

first round reviews:

Solicited review by Marta Sabou:

This submission describes the AGROVOC Linked Dataset and provides interesting insights into publishing thesauri as well as into the characteristics of the publishing process when the dataset in question has been used for decades and is part of several legacy systems. Besides the value of the dataset itself, these lessons are of great interest to other thesauri managers and the LD community at large. I now continue with my rating of the paper along the criteria stated in the Special Call for Linked Dataset descriptions.

Quality of the dataset
High. The quality of the AGROVOC dataset itself is clear since this thesaurus has been developed, maintained and used over the last 3 decades by FAO and other organizations. As a LOD dataset, AGROVOC is impressive in terms of (1) its size (30K+ concepts); (2) level of multilinguality (covering over 20 languages) as well as (3) the level of interlinking with other similarly large-scale data sources (13 sources). As such, AGROVOC is an important addition to the LOD cloud. The dataset is available online for both human (through pubby) and machine (SPARQL endpoint) access.

Usefulness (or potential usefulness) of the dataset
High. AGROVOC has been extensively used even before being published as LOD, and the paper describes novel usage scenarios of the LOD version, in particular within data.fao.org. Sections 5.2 and 5.3 describe other potential usage scenarios, but it is not clear whether these scenarios are made possible by the transition to LOD or can be realized without LOD technology. A clarification is therefore needed.

Clarity and completeness of the descriptions.
Medium. While the paper provides a good overall description of the dataset it could be improved with several details, especially those stated in the CfP. For example, the URL where the dataset is accessible is hidden as a footnote towards the end of the paper, while it should be stated clearly, very early in the paper. More details about versioning and licensing would also be useful. There is also very little about the coverage of the dataset itself besides mentioning the main domains it covers - this aspect should be revised and more insights should be given into the coverage of the source. Finally, I found that Section 2 describing VocBench (including Figure 1) had little relevance for this paper and I suggest that it should be substantially reduced to the aspects that are relevant to LOD publishing.

Minor comments and typos:
* abstract: "published as Linked Open Data cloud" => "published as Linked Open Data"

* section 1:
** "used in indexing" => "used for indexing"
** "AGROVOC maintenance" => "AGROVOC's maintenance"

* section 3:
** when table 1 is described, the statement about column 4 is confusing as the fact that an expert evaluates matches is only stated in the next paragraph.
** provide a brief summary of what the "Open Sense per Domain hypothesis" is, as not all readers of this journal might be familiar with it.

*section 5: "Semantic Web and publish on" => "Semantic Web and to be published on"

Solicited review by Willem van Hage:

This paper describes the process by which the AGROVOC thesaurus was released as linked data. It is important that a paper such as this is accepted, because this is a very important hub dataset in the linked data cloud pertaining to (the overlap between) agriculture, food safety, development aid, environmental sciences, ecology, and related fields of applied science. The paper does a decent job describing the development process and the external (mostly non-scientific) factors that influenced the design of this process.
The paper lacks concrete descriptions, like by means of a few examples, of how terms, their properties, and links to other vocabularies are modeled in practice. That is, I would like to know what the SKOS-XL, OWL, and the RDB schema implementation look like in close up.
The paper would benefit from a more concrete description of modeling issues encountered. For example, the experiences during the development of this data set played a part in the development of SKOS-XL. It would be very interesting to know what kind of things are necessary to model for common applications of AGROVOC (of which I know there are many) for which regular SKOS is insufficient.
In summary, this paper would be much improved by adding concrete examples throughout the paper.

Solicited review by anonymous reviewer:

SUMMARY
=======

The paper describes the maintenance and publishing process of the AGROVOC vocabulary. The authors present such vocabulary as a linked dataset since it is currently modelled as a SKOS-XL multilingual thesaurus, and mapped to a number of existing vocabularies, thesauri, and glossaries.

CONTENT
=======

The work concerning AGROVOC and reported in the paper is definitely relevant and interesting, since it covers key issues concerning the definition and maintenance of thesauri - namely, recording provenance information for the defined terms, and creating mappings with other thesauri.

It is however not clear whether versioning support is envisaged or not - this is a key feature to support backward / forward semantic interoperability, but it requires internal mappings between terms from different versions of the same thesaurus. I suggest the authors revise the relevant section ("Conclusions") to make this point clear.

Also, authors should explicitly state, possibly already in the introduction, where and how the AGROVOC linked dataset is available, as requested in the call for papers. Actually, from the website ( http://aims.fao.org/standards/agrovoc/linked-open-data ) you can find all the required information to access the dataset, and more. But this is not said in the paper, and such URL is in a footnote.

LANGUAGE AND PRESENTATION
=========================

Language and organisation of the paper are overall good, but the quality is not the same in all the sections. In particular, authors should totally revise Section 4.1. Besides the low quality of the language, it repeats what said earlier. Also, the reference and full name of the VoID vocabulary should not be included here, but the first time it was mentioned in the paper - i.e., in the paragraph just before Section 4.1. Authors should consider merging this section with the parent one - this is the only subsection of Section 4.

Finally, authors should take care of references (quite a few of them are incomplete).

IMPLEMENTATION CONCERNS
=======================

I have some concerns about the actual implementation of AGROVOC according to Linked Data principles, which I list below. Of course, these are technical issues which can be easily fixed, but, if persisting, they should be acknowledged and motivated in the paper.

(A) Notably, a VoID description of AGROVOC is available ( http://aims.fao.org/aos/agrovoc/void.ttl ). This allows having a summary of the available mappings, the SPARQL endpoint, etc. However, it could be enriched with information concerning the used vocabularies - see VOAF ( http://labs.mondeca.com/vocab/voaf/ ) and LOV ( http://labs.mondeca.com/dataset/lov/ ).

(B) Not clear which is the URI space used to tell apart AGROVOC concepts and their descriptions. E.g., URI http://aims.fao.org/aos/agrovoc/c_12332 denotes concept c_12332, but it is also the URI of its description, resolved to one of the available representations (HMTL, RDF/XML, Turtle) based on HTTP content negotiation. However, the RDF/XML and Turtle descriptions of concept c_12332 are also available at http://aims.fao.org/aos/agrovoc/data/c_12332 , whereas the corresponding HTML representation is also available at http://aims.fao.org/aos/agrovoc/page/c_12332 . As far as I know, no HTTP 303 redirection from the URI of the concept to the URI of its description is implemented.

(C) Always taking as an example concept c_12332 - in the corresponding RDF/XML document, the description of such concept is given as URI http://aims.fao.org/aos/data/c_12332?output=xml . Similarly, the corresponding Turtle document uses URI http://aims.fao.org/aos/data/c_12332?output=ttl . This is not completely correct. Such URIs are rather the URLs of the available distributions of such description (i.e., its RDF representation, serialised either as RDF/XML or Turtle). In other words, the URI of the description of a given concept should not point to any of its distributions, which can be however included into the concept description itself. E.g.:

Description of c_12332

RDF/XML document about concept c_12332

Turtle document about concept c_12332

...

...

(D) The RDF/XML description of concepts makes use of namespace prefixes like j.0, j.1, … for well-known vocabularies (e.g., SKOS, SKOS-XL) instead of using the most usual ones.

(E) The HTML pages of concepts include an "alternate" LINK tag, saying to point to an RDF/XML which is instead Turtle, and served as text/rdf+n3.

(F) At the bottom of the HTML page, the "As Turtle" link points to a Turtle file, served as text/rdf+n3.

(G) Note that media type text/rdf+n3 is incorrect. The correct and registered media type for N3 is text/n3 - see http://www.iana.org/assignments/media-types/text/n3 .

Tags: 

Comments

Thanks to all reviewers for the very constructive reviews and helpful suggestions.

We resubmitted a new version of the paper, taking into account all suggestions, trying to balance requests for more details with space availability. We have exceeded one page (7pages now), and still we removed the old section on VOCBENCH as Marta Sabou suggested and put a new wider giving many more details and statistical info about the Dataset.

Thanks also to the anonymous reviewer for the "implementation concerns" section of their review, as, specifically for this kind of call (datasets, tools etc..), it is not only the paper which is evaluated, but the significance and soundness of the resource.

Regarding A) We have added a VOAF description inside the new vocabulary (and pointing to the AGROVOC Dataset) which has been recently separated from the AGROVOC Dataset (the Agrontology: http://aims.fao.org/aos/agrontology, which is now mentioned in the paper too). These new changes have already been implemented and will be uploaded soon.

Regarding B, we thought that we could maintain both the single URL through content negotiation as well as those data and page as "alternative" URLs for the representation of the concept which is still the main URI. we'll investigate on it

From C to G, these are all things we need to change on our remote installation of Pubby (and some of them were already known), it will take some time (and involving our hosting partners) but we will cover them asap.

This is how we addressed Willem's comments:

The paper still lacks concrete descriptions, like by means of a few examples, of how terms, their properties, and links to other vocabularies are modeled in practice.
The paper would benefit from a more concrete description of modeling issues encountered.

In Sec. 2, we specified how URIs, terms and properties are modeled with skos-xl, as an example we used concept "maize". We point out that, for the sake of human readability, terms are available both as skox-xl and as as skos labels. A practical example is also given, about how skos-xl is exploited to model editorial information such as elements' date of creation and update. We added a paragraph to comment on the need to verify the modeling style currently adopted in various topic areas of agrovoc: this activity is expected to result in an update of the vocabulary underlying AGROVOC, called "agrontology".

In Sec. 3., Table 1 is updated, and an example of exact match is provided.

Woopps, sorry, a few points which have actually been addressed in our last submitted version (2.4), but for which we did not provided a reply here, are:

1. Versioning support. It is still unclear whether versioning support for AGROVOC is actually planned, or it is something that is not going to be provided in the near future.

In section 4 (second-to-last paragraph) we have added a description of the level of support for versions which is provided in the dataset.

2. References. References [5] and [10] are still incomplete (missing title).

fixed

3. I would like to thank the authors for the feedback they provided on the implementation issues. However, I would suggest the authors briefly acknowledge that some work is underway to fix some issues in the implementation of AGROVOC according to Linked Data principles. Such a statement might be included at the end of Section 4."

Honestly, we would avoid mentioning very specific technical issues which still are being fixed (e.g. a wrong mime-type for one of the formats, just to make an example). As said, some of them depend on the current hosting we have, and may be fixed very soon, possibly before the issue containing this article is being published. For the same reasons, other bugs (like out-of-sync info) could manifest in the following months, so reporting them on a journal maybe more confusing than useful. However, following editor's choice, we may provide this detailed report if required.