A Benchmark Dataset for Industry 4.0 Production Line and Generation of Knowledge Graphs

Tracking #: 3198-4412

Muhammad Yahya
Aabid Ali
Qaiser Mehmood
Lan Yang
John Breslin
Muhammad Intizar Ali

Responsible editor: 
Guest Editors SW for Industrial Engineering 2022

Submission type: 
Dataset Description
Industry 4.0 (I4.0) is a new era in the industrial revolution that emphasizes machine connectivity, automation, and data analytics. The I4.0 pillars such as autonomous robots, cloud computing, horizontal and vertical system integration, and industrial internet of things have increased the performance and efficiency of production lines in the manufacturing industry. Over the past years, efforts have been made to propose semantic models to represent the manufacturing domain knowledge, one such model is Reference Generalized Ontological Model (RGOM). However, its adaptability like other models is not ensured due to the lack of manufacturing data. In this paper, we aim to develop an I4.0 benchmark dataset that can be used to validate the tools, techniques, and methods. This work is a result of collaborations with the production line managers, supervisors, and engineers of a football industry to acquire realistic production line data1. Knowledge Graphs or Knowledge Graph (KG) have emerged as a significant technology to store the semantics of the domain entities. It has been used in a variety of industries, including banking, the automobile industry, oil and gas, pharmaceutical and health care, publishing, media, etc. The data is mapped with RGOM classes and relations using an automated solution based on JenaAPI, producing an I4.0 KG . It contains more than 2.5 million axioms and about 1 million instances. This KG enables us to demonstrate the adaptability and usefulness of the RGOM. Our research helps the production line staff to take timely decisions by exploiting the information embedded in the KG. In relation to this, the RGOM adaptability is demonstrated with the help of a use case scenario to discover required information such as current temperature at a particular time, status of the motor, tools deployed on the machine, etc.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 07/Sep/2022
Minor Revision
Review Comment:

In this revised Dataset Description article, the authors evaluate their previous work on the Reference Generalized Ontological Model with a use case dataset on production line processes in the football manufacturing industry. The authors generally addressed the issues raised in the reviews of the first version of this paper. Contentwise, the paper has improved significantly and the narrative is much clearer. The provided links are working.

Still, there are a few remarks that I think would benefit the paper if implemented, as listed below.

- The title is a bit unclear; as the benchmark dataset is oriented towards generation of KGs focused on I4.0 Production Lines (rather than on I4.0 production lines ànd generation of KGs). Quick suggestion (not necessarily the best one) “A Benchmark Dataset for the generation of Knowledge Graphs describing Industry 4.0 Production Lines”.

- It is not clear to me how the paper proposes an automated approach for mapping the data into RGOM to build a KG (see research contribution section). While I can read in Section 4.2.2 that “Amidst this, semantic extraction is based on the features of the relationship model that is utilized to extract the type of relationships and the relation between them”, I find it difficult to interpret this sentence. Also, the authors mention that the data can be mapped using R2RML, but this mapping algorithm is not part of the contribution of this paper. If the mapping algorithm included in the GitHub repository is generally applicable, this would need to be illustrated, including the boundaries for input files (e.g. what spreadsheet columns etc. are needed for the algorithm to automatically map unstructured data to RGOM?)

- Having a dataset that implements the FAIR principles is mentioned as a specific goal of the paper. I would suggest to quickly revisit how the dataset conforms to the 4 letters of the acronym in the evaluation section.

- A persistent URL referencing the RGOM ontology should be included, so others can implement it, e.g. using a footnote when first mentioning RGOM.

- The variables in Listing 1 and the columns in Fig 6, which yields the results of the query, do not match.

- Listing 2 shows a query with 2 variables, with ?machine being a subject. However, in Fig 7 this yields a Literal instead of a URI. The query needs to reflect this. Also, why is a SPARQL filter used to map the variable to a URL instead of directly including the URL of the machine in the query? E.g.:

SELECT ?machine ?tool WHERE {
d:Machine_2 a smo:ProcessingMachine; # this triple can be left out
smo:hasName ?machine;
smo:hasTool ?tool .

- The query in Listing 3 is syntactically incorrect, as the query variable Motor_Name has two question marks instead of one. Also, there should not be comma’s in between the variables and the ?status variable is mentioned twice.

- Fig. 10 shows a different layout compared with the other figures that represent query results. To have a unified table layout improves consistency. Also generally including a text-based table instead of a figure allows to include the tables when searching the text, although this is not an issue in this paper.

- The running title is now a generic “Running head title”. You can set one using \runtitle{theRunningTitle} (LaTeX).

- P. 15 #23 “For example, this approach can be extended to other similar manufacturing processes such as volleyball and rugby ball production” => I suppose the approach is much more generic than producing sports balls, but applicable to any production line process. Perhaps mentioning a diverging production line example would better illustrate the general applicability of your approach?

- P. 15 #23 “The general process for manufacturing such products resembles.” => incomplete sentence

- Bosch, as a brand name, is sometimes not capitalized.

- Check use of articles and singular/plural verb conjugation. Especially when talking about KGs, sometimes there are missing articles (e.g. p3#44,#51 and p8#41) or the singular (KG) is used with a verb in plural (e.g p.2#28)

Review #2
Anonymous submitted on 12/Sep/2022
Review Comment:

The current revision is acceptable for publication.

Review #3
Anonymous submitted on 02/Nov/2022
Major Revision
Review Comment:

The authors have considered some of the previous reviews. However, there are still several points that are unclear and are highly relevant to the understanding and proper use of the proposed KG. This does not facilitate its use. As a general comment, the article does not provide a detailed and explicit description of the KG. It mentions some of the ontological models reused but also mentions some notions such as context, situations that are not clear how they are used by the proposed approach. In addition, there is no discussion of what are the contributions or advantages of using a KG or of having a semantically enriched representation of a dataset. Here are some more specific comments:

The two paragraphs between page 2 line 42 and page 3 line 7 are redundant. Please keep only one or combine them.

Page 3 line 23-24 "However, their paper fails to present any use case allowing the ontology to be evaluated." This sentence is not clear, why do they fail?

Related work section lists a series of papers and mentions very briefly what are the relationships between the different works and only argues that the datasets that were used to build the ontology models are not accessible. Those ontology models were only built from datasets? They have not been built using expert or domain knowledge? I think this section could be revised to make clearer the link between the works as well as which parts of these ontology models have been reused in the KG proposed by the authors.

Regarding the explanation given on page 6 lines 6-34, it is not clear if the data of type "variable attributes" are real or generated. In the first paragraph the authors state that they recorded data such as power consumption, temperature, pressure, etc. implying that real sensors were used but in the next paragraph they state that the values for the "variable attributes" are created using a uniform probability distribution in each sub-process. This should be clear since the authors argue in the state of the art section that the lack of real data is a shortcoming.
I do not understand why in the example described on page 6 lines 25-30, it is said that f(x)=2. Where does this result come from? Even if formula (1) was used I am not sure that the example is correct.

Page 8, first paragraph. It would be useful to detail why it is necessary to have a semantically enriched representation of the data since this is not described in the article. For example, explain which are the advantages that would be provided by having this semantically enriched representation of the data, such as to make a better exploitation of the data, etc.

In Section 4, what is the purpose of layer 3? I see this layer more as the output of layer 2, and then that KG can be exploited in the application layer.

Page 10 line 28. The term Activity appears in this sentence, what do the authors consider as an activity? Moreover, this term does not appear in Figure 4? Perhaps the authors mean Process. This leads to some doubts when the authors talk about context since they do not explain concretely what context is and they do not provide a definition of context. I believe that the authors are referring to the notion of context defined by Dey. et. al. "Towards a better understanding of context and context-awareness" (2000).

Similarly, the last sentence of line 30, "The use of core ontologies with domain presents useful information regarding a situation." What is a situation? Perhaps giving a reference would clarify.

The first sentence of section 7 "This research presents the I4.0 dataset which can be used to validate the tools, techniques, and methods required for I4.0 applications." is very strong and the article does not demonstrate how this can truly be used to validate tools, techniques, and methods required for I4.0 applications. Please rephrase.

As a general remark, it would be good if the authors would write the names of the object properties and ontology concepts in italics for example.

There are some answers provided in the letter that do not appear exactly in the article. For example, in the responses some references are given that are not in the article. I am not sure if this is a mistake but maybe the authors forgot to add them.

Please verify that your code has a license so that it can be reused.

Page 2 line 16 -> What are information models?

Page 2 line 42 -> "... i.e. ..."

Page 4 line 1 -> "In our previous work recently, we proposed ..."

Table on page 5, goes over the limits of the margins.

Page 9 line 26. I think it should be Figure 4 and not Figure 3.

Page 10 line 35. "Amidst this, semantic extraction is ..."