Deriving Semantic Validation Rules from Industrial Standards: an OPC UA Study

Tracking #: 3039-4253

Yashoda Saisree Bareedu
Thomas Frühwirth
Christoph Niedermeier
Marta Sabou
Gernot Steindl
Aparna Saisree Thuluva
Stefani Tsaneva
Nilay Tufek Ozkaya

Responsible editor: 
Guest Editors SW for Industrial Engineering 2022

Submission type: 
Full Paper
Industrial standards provide guidelines for data modeling to ensure interoperability between stakeholders of an industry branch (e.g., robotics). Most frequently, such guidelines are provided in an unstructured format (e.g., pdf documents) which hampers the automated validations of information objects (e.g., data models) that rely on such standards in terms of their compliance with the prescribed guidelines. This increases the risk of costly interoperability errors induced by the incorrect use of the standards. There is therefore an increased interest in automatic semantic validation of information objects based on industrial standards. In this paper we focus on an approach to semantic validation by formally representing the modeling constraints from unstructured documents as explicit rules (to be then used for semantic validation) and (semi-)automatically extracting such rules from pdf documents. We exemplify an adaptation of this approach in the context of the OPC UA industrial standard and conclude that (i) it is feasible to represent modeling constraints from the standard specifications as rules, which can be organized in a taxonomy and represented using Semantic Web technologies such as OWL and SPARQL; (ii) we could automatically identify constraints in the specification documents by inspecting the tables (P=87%) and text of these documents (F1 up to 94%); (iii) the translation of the modeling constraints into rules could be fully automated when constraints were extracted from tables and required a Human-in-the-loop approach for constraints extracted from text.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jürgen Bock submitted on 01/Apr/2022
Minor Revision
Review Comment:

The submitted article addresses an important challenge and missing link that hampers the realization of CPPS, namely the fact that machines are expected to communicate and cooperate autonomously, but industrial standards for protocols, interfaces and data models are almost solely provided as PDF documents in natural language. Hence it requires human engineers to interpret these standards and configure their machines in order to achieve interoperability. The authors present an approach to (semi-)automatically extract constraints from such standards and derive rules to validate concrete data models for compliance with these standards. The authors develop their approach for the OPC Unified Architecture protocol and information model, particularly for OPC UA Companion Cpecifications as meta-models for specific domains and purposes. The contribution is to achieve a high degree of automation when it comes to validate if a specific device description complies with a particular OPC UA Companion Specification, especially regarding the constraints formulated in tables and natural language text of the specification.

While the paper provides a valuable contribution, a concern might be the limited scope that comes with the focus on OPC UA. The impact of the research might thus be strongly dependent on OPC UA establishing itself as a quasi-standard industrial communication. Currently the number of Companion Specifications is modest and so is the impact of this article. In that situation one might even argue that it would be a pragmatic approach to include a formal representation of constraints in terms of rules directly in the course of the standardization process, such that these rules are provided directly by the experts themselves. Even more, since the proposed approach already includes a Human-in-the-loop supported rule extraction method. Regarding the impact, the article thus would greatly benefit from a discussion about how the provided approach could (in principle) be applied to other industrial standards, and which of the stages and methods would have to be altered. Right now, many of the techniques applied seem to be quite tailored towards OPC UA Companion Speficiation and the translation of the OPC UA NodeSet model to OWL.

Speaking of validation of semantic models, an immediate first thought is SHACL. Apart from a brief mention in the related work section, the authors do not consider SHACL as an option for constraint validation, but prefer to use SPARQL queries for validation. A brief discussion about that decision and a comparision with a SHACL-based approach would strengthen the article. (A reference to some related literatur is provided in the bibliography ([26]) but it is never cited in the text. There are 9 out of 32 bibliographic references never referred to from the text!)

The techniques used in the approach are sometimes not well justified. In terms of good research practice, it would strengthen the article if there are explanations why certain techniques are applied. This could help to avoid conclusions such as in Sect. 7.4.1 that says "... that, Neural Networks might have the problem of overfitting while Linear SVM and Random Forest can be generalized to new data more easily." This interpretation falls short in that we do not know anything about the hyperparamters of the Neural Network and the other techniques, which leaves the impression that this machine learning based approach is based on default or "best-guess" configurations.

The used mapping of NodeSet models to OWL should be mentioned earlier in the document, since it is essential to understand why SPARQL can be used to query NodeSets. Apart from that the article is very well structured and presented in clear and almost flawless English language. The reader can follow the authors' thoughts without too much effort, however, some prior knowledge in the area of industry standards is required, esp. in the beginning of the article.

Some minor issues regarding structure and language are the following:

Page 4, right column, line 37: "thee"

Figure 4 is only referred to from the text in Section 6. There should be a reference from Sect. 3 as well.

Page 10, right column, line 43: a -> an

Section 6 should be named "Stage 3 and 4: ..."

Section 6.1: Typo: "OfflineChechableConstraints" (three times)

Section 7.3: before Eq. (1): "not considered _for_ the purpose of this task"

Section 7.4.2: Eq. (1) and Eq. (2) with references to be put in ()

Page 20, left column, line 33: the statement "n-ary relations (e.g. triples)" suggests that a triple can represent an n-ary relation. However, a triple (in the sense of RDF triple) can only represent a binary relation via its predicate. Several triples with an auxiliary node (e.g. a blank node) are required to represent an n-ary relation.

Review #2
Anonymous submitted on 21/Apr/2022
Major Revision
Review Comment:

This is a paper that initially had me quite excited, because it appears to be not just to tackle a significant real world problem, but also pragmatically uses a wide array of methods to compare the outcomes. However, the whole is not really more than the sum of the parts, since the individual methods are limited in application, separate in experimentation, not really put in relation to each other, and ultimately rather fragmentary in nature. By this I mean that each method is used on a specific part of the specification, tested on that part, and then not related to the other.

Nonetheless, the individual methods are applied on significant subsets of the OPC UA "corpus". In general I have no major problems iwth the individual methods or their evaluations. However there are some common issues running across the problem that draw some aspects into question.

(1) Originality

Within the limitations of its own scope, I am willing to believe that this is the first such piece of work. As the last sentence of Section 8 states "this is the first effort to perform this complex information extraction task in the context of OPC UA specifications". However, this is a very narrow scope. This is also reflected in the related work section which only compares Semantic Web related papers. A certain degree of emphasis on this is understandable in the "Semantic Web Journal", but nonetheless I would consider related work to be anything that extracts constraints from textual information model specifications, and not just into Semantic Web formalisms.

The restriction also applies at the other end, since OPC UA is only one method among many that are used for specifying information models in Software or Systems Engineering, and arguably one of the more limited ones. That is not necessarily a fault of the method given its specific purpose and major use as a low level industrial communication specification. (As an aside, I would not consider OPC UA "recent" - it was codified almost two decades ago and is fairly well established.)

Work on constraints and constraint extraction has been going on for decades, with varying assumptions made on the source material; work done over the last decades on similar problems with other formalisms also seems also relevant and should be cited and compared.

(2) Significance

Again, as mentioned above, the general problem addressed in this paper - extraction of explicit knowledge from text - is a quite significant one, and the comparison of methods provided here is potentially useful. However, the setting described does no in my experience describe the OPC UA environment correctly.

A puzzling statement is the repeated claim (RQ1 in Section 1, later in Section 8) that the constraints in question have to be extracted from the text of the Companion Specification by end users. OPC UA is not just a data model standard, but it provides its own modelling language and notation, shown in Figs.2 and 3, the metamodel for that language, and various commercial tools that operate on models formulated using that notation. To my knowledge the usual process for Companion Specifications is that these are released together with the actual OPC UA model that is released in parallel, and in fact the textual description is derived from that information model.

In other words, structural model constraints such as existence of related entities or specific value restrictions can normal be expressed directly in the OPC UA model and the textual specifications serve as documentation. So, a machine interpretable formulation of the constraints that are expressible in the OPC UA notation already exists for a companion specification. Yet in both the introduction and in the second paragraph of the Related Work section the existence of this model is not mentioned, making it sound as if end users have to solely depend on the textual specification document. The description in the related work section of "manual identification of constraints" perhaps only refers to an experiment but does not seem to reflect actual practice.

Now, the expressiveness of the OPC UA model language is limited compared to typical software engineering notations (e.g., UML), but it seems that the types of constraints that are demonstrated in many of the examples (namely, simple structural assumptions, relationship or attribute examples) still fit within its expressiveness and should rather be validated on the actual information model. This seeming contradiction directly affects RQ1 would need to be cleared up before the paper can be published.

For example, it is not surprising that "43%" of the tables identified deal with object type definitions (Section 5.1.2). That is what the OPC UA data model is most suited for.

A fundamental ambiguity that runs through the paper is the casual equation of semi-formal with formal notation, starting in Section 3: "Rules are formally represented constraints formulated in a semi-formal notation". If the notation is semi-formal, they are not formally represented. (Pseudocode is not code.) This is carried through to later sections where SPARQL and "SFN" are treated as if equivalent. Yet, repeatedly in the paper a references made to explicit validation, but it seems this cannot happen, and also is not done in the paper's experimental parts, if the rules are written in "SFN". If the rules are not formalised because it takes too much effort for the paper, then this does affect the applicability of the method and the effort and coverage should at least be quantified. If there are expressiveness issues then these should also be clarified.

(3) Quality of writing

Some parts of the paper were confusing due to terminology not being clearly separated. One aspect is the "formal"/"semi-formal distinction, and the another interaction between "rules" and "constraints". Constraints does generally refer to model constraints, whereas rules refers to executable or implementable production rules, but these cases should always be clearly distinguished.

At this point I would not consider the paper to be completely self-contained since it completely omits any presentation of the transformation from OPC UA to Semantic Web formalisms, but knowing these is important to understand the application of the various methods. Yes, the other paper is publicly available (reference [7]), but at least a short summary should be included, and should mention the limitations of the transformation. Even sections of the paper that clearly depend on this transformation do not mention it. E.g., 6.1 talks about "loading the OPC UA information model" into the SPARQL endpoint. But it is the OWL/RDF transformation of that model. This sort of distinction would be central to the clarity of description. (I also note that [7] points out that particular types of constraints captured in the OPC UA model cannot be currently expressed by the transformation. Presumably this also affects the types of constraints that can be captured here?)

Turning to the specific methods, I have little concerns with the methods applied, but I note the obvious significant manual overhead involved in most of them.

Constraint Extraction from tables

The Constraint extraction method is based on "manually specified" lists of included or excluded terms. Ultimately this seems to be a highly syntactic method . It would be helpful to know the number of terms used in these lists for *all* the table types studies, not just one example.

It is mentioned that the "rule taxonomy" contains a "constraint taxonomy" but this is not discussed. The interaction would seem to be important.

Constraint extraction from text

The pattern based method seems fairly superficial compared to other methods, but I consider it acceptable given the specific domain; you have to start at some point.

Rule generation

Again, there is not really enough information in this section to understand how the rule generation really works. What different types of information are expressed by the meta-variables in the rule templates? To understand the method, I would expect to see all of them, not just one example. We are told that one particular type of constraint (a relatively forward object type definition) is related to 12 rules. How many rules are required for the other types?

"Human-in-the-loop" creation

I cannot see how this is any different from the "manual creation" discussed in the context of the Semantic Validation working group near the end. The use of the Turk approach would seem to merely cloud the degree of competence that the supposed "experts" have which is never discussed. Given this restriction and the fact that for many cases only one person was available, these results seem pretty useless. I also note that Figs.16-18 are essentially unreadable. I recommend replacing the screenshots (?) by reformatted versions to make them readable.

In the explanation of Task 2, there is a reference to the "shown rule set". It is not clear to me where that rule set comes from. Who wrote it or how was it extracted?

Evaluation section

The evaluation of the individual techniques cannot be really faulted. The main question that remains however is what one reads between the lines when looking at some of the sections, which is that a lot of manual work is required in all cases (such as the fact that particular keywords are typically limited to a single table type).

Some minor language issues:

"Rule templates are generalized versions of rules that use variables". Normally, rules use variables or they extremely limited in terms of expressiveness. I am assuming that the "variables" mentioned here are meta-variables to be instantiated by

Section 8:

Do not use triples as an example of "n-ary relations". Triples are a ternary relation only as an implementation concept; as a formal concept, triples would be named binary relations.

Minor issues/typos that I noticed:

Section 3.1

"Globale rules"
"an addional rule"

Section 5.1.1



In summary, this is an ok paper but with many restrictions in terms of applicability and with some surprising assumptions on the setting of the method. I would expect these to be thoroughly clarified before the paper can be published.

Review #3
By Gianfranco Modoni submitted on 24/Apr/2022
Major Revision
Review Comment:

This paper focuses on an approach to semantic validation by formally representing the modeling constraints from unstructured documents as explicit rules and (semi-)automatically extracting such rules from pdf documents. The contribution of this research work is new and significant and it has a promising potential as stated by the authors in the conclusions: “this work has the potential to bring a major contribution towards automatic, semantic validation within the widely used OPC UA standard”, while the current alternative of the proposed approach is a manual collection and formalization of the constraints “which is a tedious and time consuming activity”.
However, the presentation of some parts of the manuscript should be enhanced. In particular, it is not clear what is still missing in the proposed approach to apply it in production, also in consideration of the various encountered difficulties during its application (e.g., in the step 2 of Stage 2 “the inconsistencies in the appearance of a specific table type in the companion specification documents”, “the currently used software Camelot has some limitations as discussed above for extracting information from tables in textual documents”, etc.). In this regard, the authors should specify (in a conclusive section) the challenges still open and also the needed efforts to adopt then the complete proposed approach to successfully derive correct validation rules from OPC UA specifications and to use these rules to validate an information model.
Another aspect that authors should enhance concerns the descriptions of some stages of the approach. In particular, considering Step 2 of the stage 2, it would be useful to finalize and report the whole algorithm used to categorize the tables ( in the current manuscript an unclear example of a filter is reported in Fig. 11). In addition, in Step 3 of the stage 2, it would be useful to formalize the algorithm’s code to detect constraints and formulate rules. Moreover, the algorithm needed in Stage 3 to automate the generation of rules from tabular data can help readers to understand the proposed approach. The authors should also specify in section 5.2.1 the software library used to apply the machine learning classification. The authors should also report in Section 7.2 (Evaluation of SPARQl rules), the results of the evaluation of mapping between constraint types and individual rule templates as requested by the Step 3 of the Stage 2. I also advice the authors to provide the dataset used in the evaluation process and corresponding evaluation results (e.g., by uploading it in a repository). This information, if available, can better clarify the process conducted to evaluate the approach.
Finally, the paper could benefit by taking into account the following further comments:
-) In Figure 14, 16, 17, and 18 the reported text is not clearly legible; I advice the authors to increase the image size \ resolution.
-) “Recall” is mentioned at pag 15 : “Fig. 21 illustrates the Recall analysis.”, while its definition is reported at pag. 17 with the formula (2).