Darwin-SW: Darwin Core-based terms for expressing biodiversity data as RDF

Tracking #: 635-1845

Authors: 
Steve Baskauf
Campbell O. Webb

Responsible editor: 
Guest Editors Semantics for Biodiversity

Submission type: 
Ontology Description
Abstract: 
Darwin-SW (DSW) is an RDF vocabulary designed to complement the Biodiversity Information Standards (TDWG) Darwin Core Standard. DSW is based on a model derived from a community consensus about the relationships among the main Darwin Core classes. DSW creates two new classes to accommodate important aspects of its model that are not currently part of Darwin Core: a class of Individual Organisms and a class of Tokens, which are forms of evidence. DSW uses Web Ontology Language (OWL) to make assertions about the classes in its model and to define object properties that are used to link instances of those classes. A goal in the creation of DSW was to facilitate consistent markup of biodiversity data so that RDF graphs created by different providers could be easily merged. Accordingly, DSW provides a mechanism for testing whether its terms are being used in a manner consistent with its model. Two transitive object properties enable the creation of simple SPARQL queries that can be used to discover new information about linked resources whose metadata are generated by different providers. The Individual Organism class enables semantic linking of biodiversity resources to vocabularies outside of TDWG that deal with observations and ecological phenomena.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 06/Oct/2014
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Ontology Description' and should be reviewed along the following dimensions: (1) Quality and relevance of the described ontology (convincing evidence must be provided). (2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.

This is a sound paper, based on a sound process (socially, technically) of representing the essential Darwin Core standard for biodiversity data in RDF. DwC is essence provides the Ontology, and 428 million records using it should suffice to attest for quality and relevance. This is at the very heart of improving biodiversity informatics via ontological representation. I congratulate the authors on their effort and have only minor comments; virtually all of which are contained in comments (sticky notes) made in the submitted PDF.

Overall, the manuscript is very well written (if not exactly showy), and the flow of content, tables and figures are easy to understand and pertinent. I had 1-2 minor issues with the actual RDF model; most critically the suggested need to distinguish TaxonConceptLabel (or a similar string to denote the name sec. author combination - just another string, really) from TaxonConcept (which is really more like a theory about the circumscription/boundaries of a taxon - a theory whose validity we keep on testing because the nature of taxonomic boundaries is revealed to us gradually [if ever]). "Taxon" likely needs no representation at all in DwC or DSW (or elsewhere in our virtual representations). It is too simple to be very useful.

More thorough framing and referencing would possibly increase the visibility/impact, and a bit more force at the end outlining the need to not fall behind on this RDF translation as a new standard, might also be appropriate. The core of the manuscript is not affected by this.

Review #2
By Mark Schildhauer submitted on 12/Oct/2014
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Ontology Description' and should be reviewed along the following dimensions: (1) Quality and relevance of the described ontology (convincing evidence must be provided). (2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.

1) Quality and relevance:

This paper describes Darwin-SW, which is an RDF/OWL ontology that formally specifies the relationships among several key classes of concepts from the widely-deployed Darwin Core standard for describing biodiversity occurrences-- that constitute evidence of some living organism(s)-- typically expertly identified to a species level-- as having "occurred" at some place at some time. The paper reads well and is highly relevant to the audience of this journal.

The authors modestly describe Darwin-SW as an RDF vocabulary that "complements" the Darwin Core standard for describing biodiversity specimens and occurrences, but the implications of Darwin-SW are potentially far broader. Not only does this effort semantically clarify the relationships among Darwin Core terms, it does so in a framework (RDF/OWL) amenable to broader exposure and linking of those types of data via LOD and the Semantic Web technologies, as opposed to the "flat-file" formats that are currently most prevalent for data exchange by this research community. This paper represents a significant step forward in advancing semantic modeling of key notions within the biodiversity sciences, and serves to introduce knowledge modelers and semantic web experts to some of the major concepts and interoperability challenges faced by the biodiversity research community. In the paper, the authors expand on the terse and ambiguous definitions provided for terms in the Darwin Core standard, by creating classes, instances, and object properties to link resources in formal RDF and OWL syntax. Their model has strong merits, and hopefully provides a good basis for expansion in other on-going efforts to develop more semantically competent descriptions of biodiversity data and the processes involved in acquiring and curating these types of data and specimens.

2) Illustration, clarity, readability:

While the paper is clear and highly readable, there are aspects that I hope the authors will consider for improving potential impact of their paper. First, many readers of the "Semantic Web Journal" may not be aware of what are "biodiversity data", so I strongly recommend that the authors provide a sentence or two of clarification in their introductory paragraph about this, so that the article is more self-contained. Planned introductory materials for this special issue on "biodiversity semantics" will clarify the scope of what is meant by "biodiversity", but we expect all authors to provide sufficient context about their topical area so that articles can be read independently as well.

I am concerned about some of the authors' choices of "labels" for their new concepts; and the lack of natural language definitions to further clarify these concepts within their OWL/RDF "dsw.rdf" vocabulary. Specifically:

Please define many of your terms using at least some clarifying natural language as well as richer semantic expressions-- in the RDF vocab itself. Candidate terms crying out for definition include: Occurrence, IndividualOrganism, Organism, Specimen, Token, Event.

* while recognizing potentially problematic edge cases, saying that IndividualOrganism is "not restricted to a single biological organism: it can be any sort of organism, clone, colony, or group of organisms that is typically observed or sampled over time" strikes me as highly problematic.

The notion of "Organism" is typically defined, in both informal (dictionaries) and formal (e.g. OBI) vocabularies, as strongly signifying "individual biological entities"; so coining a concept such as "IndividualOrganism" is both redundant --Organism is circumscribed according to some notion of individuality; and contradictory, by asserting that the concept is "not restricted to a single biological organism" (this latter being exactly what one would specifically expect IndividualOrganism refers to). I'd suggest other modeling approaches that retain requirements of "individuality" for organism, where "individuality" denotes extremely high interdependence (e.g. taxonomic homogeneity requirement seems reasonable) AND spatial contiguity of biological parts, such as with the tissues or cells that compose an individual body.

Thus, while the parts/components of an IndividualOrganism can be highly integrated and specialized as in a human body, or less so in a sponge colony, the concept should NOT pertain to a spatially distributed population of entities even of some uniform taxon, or the aggregate of "individual entities" comprising a species-- both of which concepts might qualify as "IndividualOrganisms" under the current definition. For example, I think it misleading to label all of some endemic desert pupfish individuals living in a single freshwater pool as an "IndividualOrganism". Rather, suggest defining a notion of "Organism" (dropping the "Individual" prefix since it is already implied), accompanied by a natural language description, and provide several clarifying examples.

* using the work Token to indicate what are more commonly called "Samples" or "Specimens" or other forms of "Evidence" is problematic.

The term "Token" is not nearly as commonly used to indicate various aspects of evidence for biodiversity occurrences as "samples" or "specimens" or other "material objects" or other information artifacts. And in technology, Token is commonly used in the context of access and authentication, or simply as an identifier. The need to describe types of evidence for "occurrences" or "observations of presence of some material entities" are pervasive in natural science, so the Biodiversity community might consider using well-defined concepts from ontologies being developed and used by allied disciplines, e.g. from OBI, IAO, BCO, or SIO. E.g. as in OBI, defining an "Image" (of some taxon) as being an "Information Content Entity" that is a "Generically Dependent Continuant" of some "Organism" that is a "Material Entity"-- would provide greater interoperability of annotations as the biodiversity/ecological community converges with the omics community in its explication of biodiversity phenomena at geospatial to genomic levels of inquiry.

* usage of the terms "normalize/denormalize" is unclear and/or variable in paper. Normalization in the context of RDF often refers to having canonical output graph formats but that seems not to be relevant here as various proposed representations provide different semantics, depending however on the definitions of the concepts involved. And normalization in the context of ER modeling (as in Fig. 1) refers to minimization of redundancy in data representation, and while a URI might be perceived as a ER key analogue, with an Open World Assumption, there is no automatic propagation of correction of a URI to all triples in which an earlier URI might have been asserted. I encourage authors to find simpler more clarifying terms whenever they refer to "de/normalization", especially in the Introduction, and sec. 3.1. In sec 3.1 it would appear that "simpler" or "alternative" models are the case, rather than more "normalized/denormalized" ones. Sec. 3.1 also points out the critical need to clarify, by natural language definitions as well as richer semantic structuring, what is an "Occurrence" and what is an "Event". While there argument about their proposed model is sound, it seems that, e.g. those who don't care about "Events" might simply use an aggregative or "Collection" concept to group a spatiotemporally contiguous set of "Occurrences", and get the same result. OR, one might view an Occurrence as a special type of Event-- it all depends on how these terms are defined using simpler constructs. As is, within Darwin Core, both of these terms have highly circular definitions that are problematic: An Occurrence is "The category of information pertaining to evidence of an occurrence in nature, in a collection, or in a dataset (specimen, observation, etc.)". Note the circularity of "an occurrence is....an occurrence..." It will be tremendously beneficial to use RDF/OWL to clarify and define some of these key terms that in Darwin Core have circular definitions.

* discussion in sec. 3.3.1 "Linking duplicates"-- the meaning of "Duplicates" is not very well described-- are Duplicates simply replicates of evidence for a single Occurrence, and if so, is a single Occurrence only for a single Organism? It really depends on what is an Organism and what is a Sample or Token, or Collection or Event. In all cases, having some natural language definitions (that aren't self-referential), along with several broad coverage examples, will be greatly clarifying! Also, for describing duplicates, the "Equivalence Class" structure might be of value for circumscribing a "set of specimens" that might be duplicates. It would be nice, but not necessary, for authors to describe this possibility...

* if "derived_from" is to be transitive, it must be better described as to why, as transitivity is a powerful characteristic. In the PROV-O ontology, "was_derived_from" is NOT explicitly transitive, while in the more constrained use in IAO, "derives_from" IS transitive. It depends heavily on how the term, which natural language explanation can assist greatly with clarifying, so that folks don't use it without knowing the heavy entailments. If the "derived_from" property is functional, it would seem more intuitively transitive...

The final paragraphs of discussion in sec. 3.3.2 onward are particularly useful in describing practical scenarios whereby DarwinSW provides distinct advantages over "flat vocabularies" due to processing via reasoners. Still, without strong definitions for some of the core class concepts, substantial semantic ambiguity will remain, and data interoperability will be challenging due to confusion among practitioners responsible for annotating their data with appropriate concepts.