A Semantic Meta-Model for Data Integration and Exploitation in Precision Agriculture and Livestock Farming

Tracking #: 2920-4134

Dimitris Zeginis
Evangelos Kalampokis
Raúl Palma
Rob Atkinson
Konstantinos A. Tarabanis

Responsible editor: 
Guest Editors Global Food System 2021

Submission type: 
Full Paper
At the domains of agriculture and livestock farming a huge amount of data are produced through numerous heterogeneous sources including sensor data, weather/climate data, statistical and government data, drone/satellite imagery, video, and maps. This plethora of data can be used at precision agriculture and precision livestock farming in order to provide predictive insights in farming operations, drive real-time operational decisions, and redesign business processes. The predictive power of the data can be further boosted if data from diverse sources are integrated and processed together, thus providing more unexplored insights. However, the exploitation and integration of agricultural data is not straightforward since they: i) cannot be easily discovered across the numerous heterogeneous sources and ii) use different structural and naming conventions hindering their interoperability. The aim of this paper is to firstly study the characteristics of agricultural data and the user requirements related to data modeling and processing from nine real cases at the agriculture, livestock farming and aquaculture domains and then propose a semantic meta-model that is based on W3C standards (DCAT, PROV-O and QB vocabulary) in order to enable the definition of metadata that facilitate the discovery, exploration, integration and accessing of data in the domain.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christopher Brewster submitted on 10/Nov/2021
Minor Revision
Review Comment:

## Overall:

The paper presents a "semantic meta-model" for the precision agriculture and livestock farming domains. The main motivation the authors provide for this model is the need for data integration from multiple heterogeneous sources. They particularly focus on spatial, temporal and data access metadata. The paper takes for granted that such a meta-model is needed (i.e. is absent from the literature or unavailable) which might be true but given the plethora of existing vocabularies/standards/ontologies in the semantic web and the agrifood sectors needs to be justified a little more carefully.

Fundamentally, the meta-model provides a way to annotate data sets with metadata which can then be queried to determine if the underlying data set has relevant data. As such the approach does not involve mapping or translation of the underlying data to a common format (which may or may not be a strategically sensible approach). It is a peculiarity of this paper that there is repeated reference to the formulation of SQL queries, and yet all the worked examples in Section 6 are using SPARQL.

Overall the paper presents an interesting and relevant approach to the use and integration of heterogeneous data sets. There are a number of conceptual inconsistencies or perhaps design decisions which would benefit from greater clarity of explanation and thus improve the paper overall and its utility to future readers and developers (see below some issues.
Originality: Somewhat original in the extension of the DCAT model and the deeper description of the form of underlying data sets. Lots of conceptual issues remaining including whether whether the approach could obtain wider adoption, and what kind of web architecture would support more widespread use of this approach. Review of existing literature somewhat limited as there is a lot of work building on DCAT which is not mentioned.
Significance: For the agrifood sector, this paper may have some impact. However, the narrow focus on time and location means there are many other aspects of data integration which remain unaddressed.
Quality of writing: Generally excellent, well structured, almost no grammatical mistakes.
Resource: The data model is available on the OGC website and provides a stable URL. There is significant documentation here for ensuring subsequent further use. The model and its availability appears to follow the FAIR data principles.

## Detailed comments:

* Sect 1
- It might be useful to explain what the authors intend by a "semantic meta-model" as opposed to a simple ontology or other classificatory structure.
* Sect 2
- Sec 2.1 is unnecessary given the audience of this journal
- Sec 2.2 and 2.3 are catalogues of relevant vocabularies but do not provide us with any critical insight. It is usual in the "background and related work" section of a paper yo make clear not just *what* another piece of work does but also what *weaknesses* there are.
- Sec 2.3 - the selection of agrifood vocabularies/ontologies mentioned seems arbitrary, and given the large number of existing vocabularies perhaps some criteria for the selection of this set of item could be offered - why are these mentioned and not others?
* Sect 3
- "functional and non-functional requirements" -- this does not make sense when speaking of an ontology or data model. Reading section 4.3 one can see what you mean by this but this is a very non-standard way of interpreting or defining "functional and non-functional requirements". See comment below.
* Sect 4
- 4.3 "requirements for the proposed semantic model in terms of functional requirements (i.e. activities that can be facilitated by the use of the model) and non-functional requirements (i.e. aspects/concepts that need to be covered by the model)" -- here I would expect the use of the term "competency questions". Later use of the term "competency aspect" is not really the same thing.
* Sect 5
- The main categories of metadata described seem to be largely concerned with providing metadata about a dataset, rather than providing a generally applicable data model i.e. to enable querying and interoperability (????)
- "However, the qb:dimension and qb:measure properties have the qb:Component Specification as domain the and cannot be used directly at the dcat:Dataset." -- grammar!
- AGROVOC is chosen as the vocabulary to define the dcat:theme of datasets. Why? No justification is provided for this choice and this is an important design choice. (Obviously other uses could make different choices while using the same meta-model). One reas
- "datasets could declare conformance to the proposed model" -- this is not clear to me because the model basically allows one to describe a data set not to ensure the data model used in the actual data follows this or any other model. e.g. weather data could be in multiple formats/models and its metadata could be described with this model nonetheless. There is limited mention of mapping in this paper but it seems to mostly concerning time and location.
* Sect 6
- "In order to enable the generation and execution of SQL queries based on the metadata, the label (rdfs:label) of the dimension/measure should be the same as the corresponding field at the database table where the dataset is stored." --- so if I have understood correctly all the metadata description of the data sets is there to enable an SQL query to be issued. This seems a very non-semantic approach so an explanation of this approach would be in order here.
- "In this dataset the temporal coverage (February 2016) is defined using the reference.data.gov.uk Time Interval vocabulary (the example uses the prefix uk-month)." --- but in the previous example xsd:dateTime is used, so how does the meta-model actually unify time (mentioned as a key dimension for integrating data sets)? In the next paragraph it says "Note that the time dimension URI is the same as the one used at listing 5 in order to facilitate the interoperability between datasets." so this further confuses the reader. Please make clearer what is going on here.

Review #2
Anonymous submitted on 15/Nov/2021
Major Revision
Review Comment:

This paper introduces a meta model to semantically enrich and integrate precision agriculture and livestock farming. The model reused several widely adopted standards, including DCAT, QB, and PROV-O. The main contribution is the creative linking between DCAT and QB through a SHACL shape so that data measures are associated with datasets. Plus, the paper is well-structured in general and is relatively easy to follow. Below are my major comments:

(1). The meta model is proposed for the domain of precision agriculture and livestock farming. However, even though there are some discussions about the specifications of agricultural metadata requirement, I do not see how these ‘specifications’ are different from other types of domain-specific data and how they are implemented in the model. The discussion in section 4 seems very general. Consequently, I do not see much novelty of the proposed meta model for precision agriculture and livestock farming.

(2). Following my first comment, I am suspicious on how the authors categorize agriculture and livestock farming data. First, I think the categorization generally works for all types of data, e.g., environmental data, urban data, and etc. So how different this proposed meta model would be for non-agriculture data? Second, won’t sensor data overlap with earth observations? Won’t maps and earth observations all include location data? So the proposed categorization is not mutually exclusive? Next, how does this categorization help the design of the meta model? Will it also be a class in the model (I do not see it in Figure 2 though)? How would such a categorization help end users to answer their competency questions? Finally, I think maps (if you mean vector-based polygons, polylines, or points) are structured data as they are stored as relational database in most GIS systems. Also, I am curious why maps and images cannot have a structure (see the statement of the second bullet point in Section 5.1)? For example, a geographic entity can have a spatial relation with another, which should be captured by a schema (data structure).

(3). In section 4.3, the authors summarized three functional requirements based on interviews and survey with stakeholders. I am wondering what kinds of questions have been asked in the interview or survey? How many stakeholders have been interviewed? Without these elaborations, the summarization seems arbitrary. Additionally, like my comment (1), I do not see how different the non-functional requirements (Table 1) would be for non-farm/non-agriculture domain.

(4). In Section 6: Application of the Model, I suggest replacing the listings to figures (for these data and structures in RDF), which would be more readable and it can save a lot of space as well. More importantly, I do not see much significance from these demonstrating examples. For example, I believe using SOSA together with DCAT might have similar capability, if not even better as temporal and spatial info are already modeled in SOSA. So a comprehensive comparison between different ways of designing the model would be needed here to show the significance of the work. Alternatively, the authors should show the capability of this model to address rather complex competency questions, e.g., semantically integrating data from various sources. The current queries shown are trivial IMHO (i.e., can be done using other models).

(5). It would also be worth for the authors to explain on whether it is a better idea to semantically annotate individual data records using RDF, rather than only on the meta level? One advantage I think is that one does not need to know both SPARQL and SQL at the same time in order to query useful data. It might be beneficial to have either relational database or linked data in a project, but not both? All these questions are fundamental to this work and worth discussing.

(6). The model is served on a long-term maintained URL. However, README in the provided Github page is missing. The replicability of the model/data might be difficult.

(7). The paper must be proofread carefully. There are many typos. E.g., Page 2 paragraph 1, “… as well as the identification the best harvesting period” --> “identification of the best”. Page 3 right side paragraph 1, “… as the definition of the proposed mode in this is paper” --> delete 'is'.

In summary, the topic discussed in this paper is trending and the proposed SHACL shape to address the linking between two ontologies is creative. But for a journal paper, this paper should be substantially improved in terms of its methodological originality and result significance.

Review #3
Anonymous submitted on 22/Nov/2021
Major Revision
Review Comment:

This paper proposes a meta model that can integrate various forms of heterogeneous data. The motivation of this model is the author’s claim that data aggregation can improve the predictive power of models learnt from that data.

The paper was well written and easy to read.

The motivation seems to have sound foundations as data aggregation in competitions such as the Netflix prize greatly increased the F-Measure of models. The literature review seems a little brief and covers the usual suspects. I would have expected that agricultural machinery would have formed part of the data of the sources that you would have wanted to integrate. Therefore I would have expected to have seen ADAPT in the literature review. Most of the resources mentioned are produced by non-profit organizations or NGOs, and therefore have excluded the big industry players, and therefore we can’t be sure that these resources are actually used by agricultural organizations that this product is targeting.

The main issue I have with this paper is the exclusion of industry or at least no mention of industry involvement. The lack of industry involvement in the development of standards or models will condemn the standard to obsolescence. In the development of the case studies I would expect more than just “domain experts” which infers the usual suspects from NGOs and academia. I would be grateful if you could provide a more detailed description of the people that you consulted. I appreciate that it was applied to an EU project, however the project does not contain any of the large industry actors such as Syngenta. Therefore I feel that this a top down approach which imposes a structure on potential end users rather than a bottom-up approach which creates a model that the majority of industry will use.

The user requirements are a little light. I would have liked to have known how participants were selected, their profile and how conflicting requirements were dealt with. Without knowing the profile of the users, it is difficult to know if these use cases are relevant or just edge cases.

The rest of the paper is an in depth description of the model, and I have no real complaints about it. My main concern is that once the project is over, that this model will simply not be used. If you could give a more robust explanation about the user requirements, and who participated I think that this paper could be a good candidate for acceptance into the journal.