GloSIS: The Global Soil Information System Web Ontology

Tracking #: 3325-4539

Authors: 
Raúl Palma
Bogusz Janiak
Luís M. de Sousa
Kathi Schleidt
Tomáš Rezník
Fenny van Egmond
Johan Leenaars
Dimitrios Moshou
Peter Wilson
David Medyckyj-Scott
Alistair Ritchie
Yusuf Yigini
Ronald Vargas

Responsible editor: 
Boyan Brodaric

Submission type: 
Full Paper
Abstract: 
Established in 2012 by members of the Food and Agriculture Organisation (FAO), the Global Soil Partnership (GSP) is a global network of stakeholders promoting sound land and soil management practices towards a sustainable world food system. However, soil survey largely remains a local or regional activity, bound to heterogeneous methods and conventions. Recognising the relevance of global and trans-national policies towards sustainable land management practices, the GSP elected data harmonisation and exchange as one of its key lines of action. Building upon international standards and previous work towards a global soil data ontology, an improved domain model was eventually developed within the GSP, the basis for a Global Soil Information System (GloSIS). This work also identified the semantic web as a possible avenue to operationalise the domain model. This article presents the GloSIS web ontology, an implementation of the GloSIS domain model with the Web Ontology Language (OWL). Thoroughly employing a host of semantic web standards (SOSA, SKOS, GeoSPARQL, QUDT), GloSIS lays out not only a soil data semantic ontology but also an extensive set of ready-to-use code-lists for soil description and physio-chemical analysis. Various examples are provided on the provision and use of GloSIS-compliant linked data, showcasing the contribution of this ontology to the discovery, exploration, integration and access of soil data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Steve Richard submitted on 20/Feb/2023
Suggestion:
Minor Revision
Review Comment:

(1) originality - The paper describes implementation and applications of an OWL ontology for describing soil observation data; it is built on several previous efforts but is a new implementation that attempts to harmonize existing models.
(2) significance of the results -- Soil management is an important global sustainability issue and this work documents an important tool to support the necessary information sharing.
(3) quality of writing -- generally pretty good, but there are a surprising number of simple typos, and some convoluted sentences that need to be clarified.
The long term stable resource link gets a turtle file -- raw content from a github repo. Based on the text I think it’s supposed to get an html representation generated from the Turtle, but that's not working. The documentation with the entities in the turtle file is sparse to non-existent, so without other supporting resources it would be hard to figure out the intention of the classes and properties. I have loaded the turtle files into protege and the Github repository content seems to be complete, but that is not what the 'long-term stable link' gets directly. Might be better for the link to point to https://github.com/rapw3k/glosis, although that is a personal repository belonging to Raul Palma and I would expect the ontology stewardship to get moved to some institution.

Recommendations:
--Provide access to the UML GLOSIS model underlying this implementation. I expected to get a more complete overview of the actual data model before diving into the OWL implementation.
--A glossary defining terms, especially ‘FeatureOfInterest’, ‘SamplingFeature’, ‘Sample’, ‘Spatial data type’, and ‘Spatial object’, would be very helpful. These terms do not seem to be used consistently in the text, leading me to suspect that different parts of the manuscript were written by different team members with somewhat different perspectives.
--There are a variety of typographical and grammatical errors, as well as hard to interpret sentences in the text; these should be edited. Many of these are highlighted, and in some cases I have suggested some alternate wordsmithing to take or leave as comments in the pdf…
--Add a table mapping namespace abbreviations to URIs and an explanation of what’s in the namespace near the beginning of the text, with consistent usage of namespace abbreviations throughout the text and Listings.
--Links to the actual ontology turtle files and Github repo should be provided in the introductory text. The turtle files are only useful to readers with tools to inspect them (Protégé, TopBraid), but it was useful to me to be able to look at those while I read, after I found the URIs in section 4. The HTML documentation pages are also useful.
--Improve the documentation for the semantics of the classes, properties and codelist items. Currently the documentation is very sparse both in the text and in the WIDOCO html documentation. It would be quite challenging for someone new to the program to figure out what they are about.
--Increase the size of the figures to make them legible
--Format turtle listings for better readability.

Comments:
Given the opportunity I would love to discuss distinctions between domain features and sampling features with the GLOSIS modeling team. Particularly in the case of Profile, which seems to be used alternately as a domain feature (like ‘stratigraphic unit’ a package of identifiable layers in the Earth that have some extent, proxy for ‘pedon’ in the soil world?) or as a sampling feature (like ‘Section’, a linear feature in or on the Earth where one or more stratigraphic units are described and sampled).

SKOS semantic relation properties are implemented as annotation, but in the SKOS OWL source, they are object properties. This should be noted in the documentation.

The document describing the underlying conceptual (abstract) model (Reznik, T. and Schleidt, K. (2020). Data Model Development for the Global Soil Information System (GloSIS). Technical report, GSP - Global Soil Partnership.) does not appear to be available. This document is cited for a description of the model that is the basis for the OWL implementation, as well as for listing of requirements that the model is intended to meet. The text outlines implementation requirements, but there is no summary of the domain requirements the ontology is intended to serve. Without some overview of the conceptual model that is being implemented, this paper is about OWL implementation techniques, and doesn’t help to understand what the OWL is intended to represent.

Review of existing models (section 2) is interesting historical perspective, but isn’t essential to description of the GLOSIS implementation. If the text needs to be shorter, I’d abridge these sections, focusing on elements from those models that have been used for the GLOSIS implementation.
Section 3 on methodology also seems low priority and could be removed.

The model in figure 3 is too small for a print publication. What do the colors assigned to the boxes(classes) in the diagram mean? What is a spatial class, as opposed to FeatureOfInterest, Spatial data type, or spatial Object—terms used in other parts of text apparently talking about the same entities. From a modeling point of view, what is show here (enlarging 800% so I can read it ) is that Site and Plot have exactly the same properties and relations, except a site can be an aggregation of plots, and plots do not have position or extent. I suspect that Plots should have position at least, if not position and extent, especially since they have positionalAccuracy. Model would be simpler if there was one Site class, with Plot as a subclass, and an aggregation relation hasPlot for Plots that are within sites. All the green property boxes (xxxxInfo) would then be associated with Site. Is this diagram from the GSP GLOSIS UML conceptual model?

Figure 4 is also too small to be useful. Also the classes are laid out so the links are tangled, See the revision of this figure with some comments at the end of this file.

Section 4.2, 4.2.1, p10. The details about using ShapeChange are only interesting to the small population of readers who might be using ShapeChange (I’m in that group), but for a wider audience, I think this could be greatly abbreviated. This information will become dated in the not to distant future. Better to put this information in the GLOSIS repo github wiki.

The patterns described in section 4.2.2 probably have a wider utility for readers working on OWL implementations. The implementation of codelists (e.g. Listing 9) is novel to me, but looks like an approach I’ll try in the future.

The layout of the Turtle text in the Listings is hard to read , with statement terminations in random places on the lines. The two column format does not lend itself to readability— can these be put it as two column figures so the formatting is more readable? I’ve put reformatted listing in a couple notes in the pdf that I used to understand the serialization better.

It isn’t until section 4.3.2 that you get a clue (the base URI) for how to actually see the Turtle files for the implementation. It would be kind to readers to provide this information earlier—maybe even in the abstract, or at least in the first paragraphs of section 4.3.

I was disappointed to see the use of hash (#) URIs for codelist concepts, model classes, and properties. See Rule 5 in Cox et al., 2021( https://doi.org/10.1371/journal.pcbi.1009041 ):
It is recommended to use slash (‘/’) IRIs for large vocabularies, rather than hash (‘#’) IRIs. When a # IRI is requested the entire vocabulary will be returned instead of just a single term. This may be acceptable for a small vocabulary, but is undesirable for large vocabularies [Berrueta and Phipps, 2008, https://www.w3.org/TR/swbp-vocab-pub/].

On documentation for the model entities and properties—the text states : "This HTML documentation is also accessible through the W3ID dereferencing mechanism. Making use of content negotiation mappings, the user is presented with the HTML documentation when accessing GloSIS resources directly with a web browser."
I tried this and it didn’t work for me. E.g. when I GET "http://w3id.org/glosis/model/layerhorizon#GL_Horizon" using my Chrome browser, I get the github raw view of the whole layerhorizon module in turtle serialization., not the WIDOCO html.

I found the github repo at https://github.com/rapw3k/glosis, and followed the documentation link https://rapw3k.github.io/glosis/docs/ to the docs pages. When I look at one of the doco pages, e.g. https://rapw3k.github.io/glosis/docs/glosis-pr-doc/index-en.html for the Profile module, the abstract is all boiler plate, and all of the introduction until the last couple sentences, which finally tell me something about what’s in the module. I expected to see that information in the Abstract. None of the classes, properties, or individuals have an explanation of what they represent. The same is true for most of the other modules (there is some description in Procedures and the imported classes in 28258 module).

p. 11 container classes. The description keeps mentioning ‘container classes’ in the UML model, but I haven’t found anything in the text to explain what these are, I’m left to guess. Some example UML would be so helpful.

Section 5.2—again, this is a pretty detailed description of using tools to convert RDF to CSV and vice-versa — probably better included in a GitHub wiki. Within a couple of years this will all be superseded by some new process.

I think the example queries in section 6 will be useful to users trying to figure out how to write sparql queries, but how durable is this—the ontology will likely be updates, rdf query tools will evolve. Will this have any utility in a couple years (beyond historic interest?). These kind of examples are extremely useful, but again might be better included in the GitHub wiki at https://github.com/rapw3k/glosis. What is important here (IMO) is that various large soil data systems have been able to map their content into a consistent interchange format using the GLOSIS OWL implementation, enabling frictionless integration of data from different sources—the example in section 6.4. The previous section 6 examples could be summarized by showing GLOSIS representations of the same kind of data (e.g. Horizon layer (or horizon) properties), with provenance, from the different sources with representations that are easily merged.

Review #2
Anonymous submitted on 21/Feb/2023
Suggestion:
Accept
Review Comment:

The paper presents the Global Soil Information System (GloSIS) web ontology for representing soil data and knowledge, describes the development process of this ontology and provides examples of applications of the ontology.

The GloSIS ontology is proposed as a new framework for the exchange of soil data and knowledge. Following best practices of ontology development, a strong emphasis has been placed on the reuse of existing (non-)ontological resources and standard data models. A (very) detailed description of the ontology development process is given, which is a mix of semi-automatic model transformations and manual operations. This is a great quality of this article to detail the different aspects and associated challenges of ontology development (methodology, resource reuse, user-centred design choices, modularization, Linked Data principles, documentation...). This makes the article a good overview of what it means to develop a domain ontology, although some sections go into a bit too much detail for my taste, e.g. section on 4.2.1 on ShapeChange's configuration.

Another strength of this article is that it mentions different aspects of making ontologies (and ontology development) more accessible to users that are not familiar with semantic web technologies, including the generation of human-friendly documentation, conversion of OWL ontologies to and from CSV files -a format with which domain specialists are more familiar- "wrapping" SPARQL queries into a REST API, etc. The authors mention a number of tools to address these tasks, which may be of great interest to ontology developers. Finally, the paper provides a good illustration of the different applications of such an ontology for data integration, access and discovery.

I don't doubt that the results are significant, given that harmonising data is a hot topic in many areas of ecology. One could criticise that the paper proposes "yet another" standard data model. However, this ontology is built on good foundations, following a solid methodology. The authors have demonstrated their knowledge of good practices in ontology development, and of the challenges associated with getting it adopted by the soil science community and data providers. Having this point addressed at the end of the paper was something I really appreciated.

The paper is very well written, if a little long, but I suppose that is the downside of precision.

Typos:
- p. 2, l. 29: "This lack of heterogeneity..." -> I guess the authors mean "This lack of homogeneity..." ? Otherwise, consider rephrasing.
- p. 21, l. 38 : "Doublin Core" -> "Dublin Core"

Review #3
By Catherine Roussey submitted on 12/Mar/2023
Suggestion:
Major Revision
Review Comment:

This paper of 25 pages presents Glosis ontology used to harmonize description of soil information and measurements.
This paper presents a huge work on ontology design for data harminzation in the soil domain. Unfortunately the ontology does not seem to be finished. The documentation should be improved.
I have lot of questions and some clarifications, examples and diagrams should be added.

I do not understand why some classes extracted from iso standard is not replaced by equivalent classes in Glosis ontology. ISO are not web ontology. ISO classes are used in dataset description, that let suppose that Glosis ontology is not complete.
I am not sure to have understood the design choice of the authors.
I am not sure that the design pattern of SOSA is well used in Glosis ontology. Why only samples are described in Glosis ontology.
Moreover, the authors’ vocabulary should be normalised. It is quite confusing to understand which type of data model is involved.
--------------------------------------------

The introduction presents the motivation, the international project and its stakeholders e.g. FAO.
The goal of the project is to build a common data model to exchange soil data between stakeholders. The soil data should also be harmonized. A global system should be able to query the harmonized datasets to help decision makers.
The specification of the ontology is to enable the harmonization of local, regional or national datasets about soil content description and measurements.
---
The state of the art presents several international standards for soil data exchange : ANZsoilML, INSPIRE, ISO 28258, OGC Soil IE, Wosis, SOTER.
Some parts of them have been the sources for the Glosis ontology like controlled vocabularies or code lists.
Q1: The ISO 28258 standard was identified as a good input for the Glosis ontology. This choice is quite surprising. The authors claim in the state of the art that this standard was never used due to the fact that the standard seems too abstract. Maybe the authors could justify their choice.

The common elements shared between different soil standards have been identified to be part of the Glosis ontology. Unfortunally the common elements are not documented with textual description in the Glosis ontology. For example the skos:note annotation properties could be used to store the definitions providing from differents standards and a skos:definition property could store the harmonized definition written by glosis project that are based on previous standard definitions.

Q2: I would appreciate that the authors could clearly identify the type of data model for each standard: pure XML data schema, Relational database schema, object oriented schema or graph based schema or ontology (web data schema based on Semantic Web technologies). UML can be used to specify different types of data schema, thus it does not help. Maybe the authors could indicate if the UML data schema is implemented in a storage system or not. What will be the name of an UML data schema without real implementation in a storage system?
For example, (page 6 column 2 line 40) SoilIE is presented as an ontology. (page 7 column 1 line 11) SoilIE is presented as an XML schema that is not totally compliant with Semantic Web technologies.
The paper presents several closed expressions like domain model, data model, data semantic ontology, ontology, web ontology. Some standards are XML data model, or UML data model, others are web ontology. Thus the authors should normalize their vocabulary to make a clear distinction between UML data model, OWL ontology or SKOS model.
---
The methodology used for the ontology design is the Neon one. Several methods from Neon were used. As far as I understand the paper I have only recognised :
* the scenario 2 where existing data models have been transformed into an owl ontology to be the basic element of the Glosis ontology.
* the scenario 3 which reused existing ontologies. The Glosis ontology reuses well known ontologies and vocabularies like SOSA, GeoSparQL, QUDT and SKOS.

Q3: Unfortunately, I do not see where scenario 7 is used in the ontology development process, that is to say, when ontology design patterns are used.

The ontology is modular and belongs to a network. The ontology is developed in an iterative and incremental way.
(Page 3 column 2 line 36) “Glosis domain model and web ontology” means that there exist two models: an abstract one and its implementation as an owl ontology.
(Page 9 column 1 line 6) “GloSIS domain model was used as the base from which to derive the ontology.”

Q4: The authors should define what is the domain model. I understand only at page 9 that first a UML diagram was designed and then transformed into an owl ontology.

I would appreciate a schema that presents the whole design process with UML diagrams, existing ontologies, data model transformation.
Moreover I am not sure about the division of modules. The profile class and its components the layer horizon classes are separated in two distinct modules.
If the domain model is a kind of UML diagram, the documentation in the git repository should contain those diagrams.

---
The name of section 4 should be replaced. This section is not only about specification.
The requirements are listed precisely. All the elements from previous standards that need to be reused are identified.
Note that Soter data model is not presented in the article even if it is one of a reused standard.

In section 4, the words “data model” and “ontology” are used, that makes me more confused.

Q5: (Page 8 column 2 line 12) “Codelists/federation of vocabularies/registries (ontologies) shall be developed for linking the data model with explicit soil body properties.”
I do not understand that point. Maybe the authors should include “SKOS models” or “thesaurus” and differentiate it from web ontology. In order to help the understanding the authors should replace “property” word by “characteristic”.

Q6: (Page 8 column 2 line 12) “Include vocabularies/registries (ontologies), but in an abstract form. This means that vocabularies may be added/modified/deleted without changing the domain model itself.” I do not understand: what is an abstract form of an ontology?
(Page 3 column 1 line 44) “abstract ontology”. Could you explain what is an abstract ontology?

Q7: I would appreciate that the authors defined what is “observed property” in section 4.1. The words “concept” “attribute” and “observation” are associated with observed property. The sentence is also confusing (Page 8 column 2 line 31). The next sentence (line 36) seems redundant.

Q8: Figure 3 is illegible. Thus I do not understand what is a container class that links a spatial class with an abstract class.

It seems that the ontology designer wants to make a choice based on the various standards:
Each soil feature has some characteristics (for example the size of a plot) that could be represented as an data type property (one RDF triple) or as a sosa:Observation graph (a set of RDF triples) due to the fact that the value of the characteristic evolved over time or can be evaluated by several methods.

A CSV is used to summarize all the characteristics of soil feature represented in the various standards :

Q9: Each line represents a characteristic and each column represents a feature. The csv helps to make the decision. But what is the decision? It is not clear to me.

As far as I understand the paper, I understand that a decision is reached for each characteristic to decide its representation as: an data type property or an sosa Observation graph. The both representations can not belong to the final Glosis ontology. Or maybe the authors decide to take only the second representation, the sosa Observation graph. Section 4.2 is not clear to me. I need an example.

The ontology designer can also take another solution: keep the sosa Observation graph and add an object property that repeats and shortens the path between the feature and the value of the characteristic.
I would appreciate it if the authors change their vocabulary. At the end I do not understand the word property: rdf object property or datatype property, sosa observable property, soil property. The same word represent different notions.

The shapechange tool is used to transform the UML diagrams into owl ontology. The UML diagram contains classes from other standards that are not web ontology. Thus during the transformation some mapping rules are expressed to indicate which class from existing ontology will replace the UML classes.
From my understanding, the UML diagram designed at first is a relational database schema where lots of attributes and cardinality constraints are represented. Note that I can not read the UML diagram thus I have to imagine. Those UML classes are aligned with OWL classes from existing ontologies.
Most classes are designed with restrictions.

Q10: I am surprised that for an integration purpose of different datasets so many restrictions and constraints are defined. What happens to the original dataset when they can not fulfill those constraints? They can not be part of the final knowledge graph.
Moreover some of the constraints list the possible values defined a skos:Concept instances.
Could the authors explain a little bit those choices of creating so many constraints and how the original data are handled when those constraints can not be fulfilled.

Unfortunately the classes and instances are not well documented with a natural language definition. In the best case, there is just a label in English which is not enough to understand the meaning of the OWL/RDF element. Some element has a definition mentioning an HTML of an iso standard which is not freely available, thus the link is not useful.

The authors should associate to each element the textual description using rdf:comment annotation property or skos:definition property. Moreover those natural definitions should explain the constraints associated with the class. For example by mentioning mandatory object properties and so one. The authors should be inspired by the definitions provided in SOSA ontology.

Moreover none of the design patterns are described in detail. That makes the ontology quite difficult to understand. More diagrams should be provided. If the authors like UML they could have a look at the CHOWLK language to present some of their design patterns.
I have tried to search in the VOWL interface search but when no textual definition is provided is impossible to understand the graph.

Q11: I do not understand the distinction between Fragments and Fragments Value classes.
Fragments should be renamed FragmentObservations . Where is defined the sample: the soil fragment where the chemical analyzes are made. A fragment is composed of several horizons or layers.
Figure 4 let's suppose that all classes related to soil features are subClass of sosa:Sample. Thus why not name them FragmentSample and so one… This hypothesis is a strong one. How is represented the fact that a physical layer ( or an physical horizon or a plot) can generate several samples? SOSA ontology makes the distinction between the ultimate feature of interest (the whole soil area) and the sample extracted from the feature of interest.

Take care of the name of your class: what means GL in GL_Horizon and GL_Layer?
Q12: In the end, I am not sure that the SOSA design pattern is well understood and reused in this ontology. For example the class “gypsum weight” is defined as a subClass of sosa:Observation and for me it is more a subClass of sosa:ObservableProperty.

Q13: (Page 13 column 1 line 47) “ There are few cases where sosa:observedProperty links the observation with a code-list.” This sentence is unclear for me.
SOSA ontology does not contain observedProperty class but ObservableProperty class.
What is the difference between an observed property from the authors and an sosa:ObservableProperty?

It seems that most of this work is to transform code lists into appropriate SOSA classes (feature of interest or observable property or result or procedure) reusing some part of SKOS model.
This work is interesting but not well presented and hard to understand. A simple example illustrated by several diagrams should be added.

Q14: Does the SKOS hierarchy of sosa:Procedure is linked to or influenced by the SKOS hierarchy of sosa:ObservableProperty?

Finally, the authors should take care to normalize their vocabulary.
Glosis ontology contains lots of defined classes and skos:Concept instances . At the end, I am a little bit confused: what is the ontology (the web data schema) and what are the SKOS models (controlled vocabulary list) and what are the knowledge graphs (the dataset that populates the ontology and reused SKOS models?

All the qualitative values defined in code list and used as result of sosa:Observation are defined as instances of two classes: skos:Concept and a new domain class.

The iso 28258 data model is translated also into an owl ontology in order to provide some alignment between Glosis ontology classes and iso 28258 classes.
The Glosis ontology is available as TTL files from a git repository. The URIs are defined as permanent URIs.
The documentation is produced by widoco tool, but the resulting HTML pages are quite poor. Lack of definitions, problems in the label of classes, no static diagrams available.
----
The maintenance and evolution of Glosis ontology is based on exchange between soil experts and owl experts. To do so some csv files are produced to help soil experts to propose some modifications and review the ontology.
The transformation (csv ↔ owl) is performed by a new tool developed during this project.
If csv is used it will be easy to ask the soil experts to propose their own definition of Glosis elements!
---
Section 6 presents some knowledge graphs of soil dataset using Glosis ontology and related SPARQL queries.
The datasets are :LUCAS top soil dataset, global soil respiration database, World soil information service dataset (Wosis). Wosis dataset was not precise enough thus the authors explain their translation choices.

Q15: Why are iso 28258 classes used in those datasets? I would expect that Glosis ontology is complete and sufficient to represent all the soil information?

All the datasets are available and published on the web using a SPARQL endpoint.
Some SPARQL queries are presented related to different datasets and a federated query is proposed at the end to query all datasets.
Some rest api services were also developed based on previous SPARQL queries to access the results.
---
At the end, some future works are proposed. SOSA ontology does not describe the uncertainty of the measurement, that is one of the possible extensions.

mispelling--------------------------
semantic web should be replaced by Semantic Web.
Doublin core should be replace by Dublin core
Glosis should be written the same way in the whole document GLOSIS GloSIS...

page 2 column 2 line 30: “lack of heterogeneity” are you sure… I would expect lack of homogeneity. If it is not an error could you explain in detail why lack of heterogeneity is a problem.

Page 5 column 2 line 43 “INSPIRE ...most used ontology” INSPIRE data model is not an ontology, is it?

Page 6 column 2 line 27 “bur” I do not know this word.

Page 11 column 1 line 2 “xx” what means XX? I imagine that in “PP_name” expression, PP is the prefix that identifies the standard ! Thus what are TM, CQ and DQ?

Page 13 colmun 2 line 9: “sublass” → subclass?

Page 14 colmun 1 line 38 “SKOS ontology”… I would prefer that the authors make a distinction between thesaurus represented as SKOS model and data schema (ontology).

Page 19 colum 2 line 23 footnote problem

page 20 column 2 line 16: “sport” → support?

Page 22 column 2 line 46: “ the the” remove one

page 25 column 1 line 30 “swathe” do not know this word.