A Benchmark Dataset for Industry 4.0 and Knowledge Graphs

Tracking #: 3029-4243

Muhammad Yahya
Aabid Ali
Qaiser Mehmood
Lan Yang
John Breslin
Muhammad Intizar Ali

Responsible editor: 
Guest Editors SW for Industrial Engineering 2022

Submission type: 
Dataset Description
Industry 4.0 (I4.0) is a new era in the Industrial Revolution that emphasizes machines connectivity, automation, and data analytics. The I4.0 pillars such as autonomous robots, cloud computing, horizontal and vertical system integration, industrial internet of things have increased the performance and efficiency of production lines in the manufacturing industry. Over the past years, efforts have been made to propose semantic models to represent the manufacturing domain knowledge, one such model is Reference Generalized Ontological Model (RGOM). However, its adaptability like other models was not ensured due to the lack of manufacturing data. In this paper, we aim to develop an I4.0 benchmark dataset that can be used to validate the tools, techniques, and methods. This work is a result of collaborations with the production line managers, supervisors, and engineers of a football industry to acquire realistic production line data. Knowledge Graphs (KG) has emerged as a significant technology to store the semantics of the domain entities. It has been used in a variety of industries, including banking, the automobile industry, oil and gas, pharmaceutical and health care, publishing, media, etc. The data is mapped with RGOM classes and relations using an automated solution based on JenaAPI producing an Industry 4.0 Knowledge Graph. It contains more than 2.5 million axioms and about 1 million instances. This KG enables us to validate the adaptability and usefulness of the RGOM. Our research helps the stakeholders to take timely decisions by exploiting the information embedded in the KG. In relation to this, the RGOM adaptability is validated with the help of a use case scenario to discover required information such as current temperature at a certain time, status of the motor, tools deployed on the machine, etc.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 17/Mar/2022
Minor Revision
Review Comment:

The contribution of this paper is potentially significant. The lack of an open dataset for Industry 4.0 research is known and this article provides ontology and KG datasets. The paper is acceptable in general. One suggestion is that it will be clearer if the authors address how this football production case is actually an Industry 4.0 case instead of listing the sequence of the production process. I suggest utilizing Figure 1 to address this. And multiple typos and consistency (e.g., capitalization are noticed. I suggest reading and updating the manuscript thoroughly.

Review #2
Anonymous submitted on 18/May/2022
Major Revision
Review Comment:

In this “Dataset Description” article, the authors evaluate their previous work on the Reference Generalized Ontological Model with a use case dataset on production line processes in the football manufacturing industry.

In the introduction, the authors identify a lack of openly available knowledge graphs to be used in use cases for Industry 4.0. However, it should be clear that they are specifically targeting open datasets for production line processes – as Industry 4.0 is a very broad term, and for other industries than production lines there might exist datasets which are neither toy examples nor closed-access.

The authors mention this benchmark dataset allows to verify the adaptability of their RGOM data patterns. Although this article is a dataset description that focuses on the creation and availability of a particular KG, I think it is nevertheless beneficial to situate this in the larger picture: as the authors mention, one of the goals is to demonstrate/verify the RGOM framework’s adaptability. I think this is difficult with only a single dataset – in my opinion, flexibility can only be demonstrated by applying multiple datasets which contain significant differences. Also, it will help the reader to understand the value of the paper if it were made more explicit how to re-use the dataset in future use cases. In the conclusion, it is mentioned that RGOM and the underlying dataset will help the research community to validate their tools and techniques. However, it seems that the dataset will be more useful as an example for industry on how to set up a Linked Data-based production line, rather than for research purposes? As indicated in the submission type guidelines, it is recommended to illustrate adoption of the dataset by third parties.

The description of the dataset itself (parameters as mentioned in the guidelines for submission types, “Dataset Description”) is insufficient. Please check again the Journal requirements for a “Dataset Description” paper. E.g. the URL in footnote 2 leads to the Github repository that contains the code to generate the dataset. Although a URL to a (inaccessible) Google Drive file is given in the repository Readme, a dereferenceable URL to find and/or query the dataset itself is not given directly in the paper.
The RGOM framework itself is introduced in Section 4.2. However, I think this should happen earlier in the text, as it seems to be a core concept of the paper, and a main reason to write this contribution. Maybe it can be introduced briefly in Section 2: Related Work. Doing this will help the reader situate the work in the sections to come.

In Fig. 2, confusion may arise as there are some graph nodes which seem to be Literals – probably because of the use of quotation marks (e.g., “cutting TPU roll process number 51”). It appears the authors mean to identify IRI concepts in a more ‘human-readable’ way, but they should make sure to avoid this confusion – RDF Literal values cannot have outgoing edges.

When introducing the ontological structure of RGOM (Section 4.2.1), it is not clear in Fig. 3 whether the authors define new concepts or reuse existing RDF vocabularies. This becomes clear in the SPARQL examples of Section 5, but should already be mentioned when introducing the ontologies.

Some last notes:
- A reference worth mentioning for mapping the non-RDF dataset to RDF (Section 4.2.2) is R2RML (or RML).
- In section 5: is it relevant to mention the used computer system when no performance benchmark is included in the paper?
- Also please check if all abbreviations are introduced fully on their first occurrence in the text (e.g. CPS, SIB) and check table 1 for consequent usage of punctuation marks in table cells.
- Finally, I strongly suggest to perform a profound spelling/grammatical check and correct existing typo’s in the text.

Review #3
Anonymous submitted on 10/Jul/2022
Major Revision
Review Comment:

The article proposes an approach to build an I4.0 benchmark dataset by using KG, called I4.0 KG, to integrate data from sensors attached to machines in a production line for the manufacture of soccer balls.

As a general comment, the article lacks of a concise and clear description of the key terms characteristics of the dataset. This does not help to facilitate its usage for different purposes. No information about dataset maintenance, reported usage and known shortcomings or limitations is provided. Furthermore, the article does not demonstrate what is the advantage or the contribution of using a KG in addition to the ontological models that exist in the literature for the integration of heterogeneous data and the access to this data. I think the authors should emphasize this to make it clear to the readers. Also, the authors talk about Industry 4.0 in general but the KG they present is specific to soccer ball manufacturing. I strongly recommend the authors to dedicate a section to discuss the possibilities of adapting their approach and the proposed KG to be used in other manufacturing activities.

Here are some more specific comments:

The title of the article is not clear. I don't understand why KG appears at the end.

In the abstract:
"Our research helps the stakeholders to take timely decisions by exploting the information embedded in the KG." - I do not see this verified in the article.

In the introduction:
Mass production was achieved in previous industrial revolutions, not in I4.0.
I4.0 is more about the use of various technologies and also the use of artificial intelligence to make better use of resources to optimize production.

The paragraph between lines 12 and 24 in page 2 is not clear. Maybe, it would be better to give the necessary definitions and then describe the differences and how they complement each other or how they are linked. Ontology and KG are not defined, please add the definitions here with the corresponding references.

The authors enumerate the contributions of the article, however I do not think they are all contributions. For example, the last one is the validation of RGOM, a model that it is not sufficiently described in the article. I do not think the validation of this model is a contribution of this article. It is more to prove that the KG built from RGOM is in fact a contribution.

In the related work:
"... Internet of Things, Internet of Services, Cyber-Physical Systems, Digital twins, ..." are not technologies. Please rephrase.

The state of the art is difficult to understand. I think the authors should work on this section and make clear the link between the existing works and what their limitations are so that it is clear how the proposed model addresses those limitations and to what extent.

In section 3:
"... acquisition and generation of the dataset." This phrase is not clear for me. Maybe, data acquisition and dataset construction ?

There is no transition between the first paragraphs of this section with subsections 3.1 ... 3.9. At the beginning two types of attributes, static and variable, are mentioned and then it goes on to describe each machine without transition. Adding a transition here would make the text more readable and aid understanding.

Regarding the random creation of the variable attribute values, the authors give a reference [17] that explains how these values are generated, however I think it would be worth to give more details about this in the article to see how accurate these values are. Also, it would be interesting to know if in the creation of these values the relationship that exists between certain properties (for example, the temperatures of the different components of a machine) is taken into account. Is it possible to represent these relationships (perhaps physicals) in the proposed KG? I think this is a key point and would allow to further enrich the data of the model.

In section 4:
"IT silos" is a very specific term, maybe give the definition or use another term like data storage.

"... usability of this data for, e.g., subsequent analyses and reasoning." - I do not understand this phrase.

Linked Open Data was never mentioned in the article before. Maybe briefly explain what it is about.

"This goal can be ..." - Which goal?

"The following describes the steps ..." -> Maybe the layers and the interaction of the different components instead of "the steps".

"... at a certain timestamps ..." -> at different timestamps.

"... unnconnected data" - What does it mean? I think it is a very general statement, it would be better to be more specific and for the authors to make it clear what they mean by "unconnected data".

"... interaction of production staff with unconnected data is very difficult." - I do not understand this phrase. It refers to access to information, interpretation, ... ?

The authors state that the RGOM model is inspired by the standards adopted by RAMI4.0. RGOM is also based on the model proposed in "Giustozzi, F.; Saunier, J.; Zanni-Merk, C. Context modeling for industry 4.0: An ontology-based proposal. Procedia Comput. Sci. 2018, 126, 675–684" and reuses other ontologies such as the Time ontology, SSN, among others. The reference [7] cites these models. The construction of the KG is not well described, it is difficult to see the link between the KG and these ontological models and how the KG is constructed from these modelsand the data. This is linked to my general comment that it is not easy to see what is the contribution of the KG with respect to using these ontological models.
I think this whole section should be rewritten and restructured to make it clear how the KG is constructed and why it is useful and necessary.

In section 5:
It is difficult to see the adaptability of RGOM through this example. The use case is very simple and does not demonstrate the usefulness of the KG. I think that the queries are too simple, maybe think about adding more complex queries that allow to see the real usefulness of the KG and the advantages it offers in terms of integration of heterogeneous data and with different temporal resolutions.

In the query of listing 3 it would be helpful to give more details about the status of the engine of a machine. I do not see the concept Status in the model, maybe it should be described how this status is obtained or what it represents (for example, if it can be seen as an abnormal behavior). This would show that the KG offers this kind of semantic information that could be exploited by an operator and even obtain more information associated with this abnormal behavior to determine its causes.
Furthermore, maybe add third-party uses to provide evidence of the usefulness of the KG dataset. For example, a possible application (and not just say methods and tools) that can make use of KG to demonstrate its usefulness. For example, how KG could help to create suitables datasets to build machine learning models for predictive maintenance.

Can this KG be adapted to another case that is not the manufacture of soccer balls? I think that the authors could make a discussion about this and give some hints on how to do it beyond that they do not fully validate this adaptability.