Change Impact Analysis and Optimization in Ontology-based Content Management Systems

Tracking #: 480-1675

Authors: 
Yalemisew Abgaz
Muhammad Javed
Claus Pahl

Responsible editor: 
Lora Aroyo

Submission type: 
Full Paper
Abstract: 
Ontologies are used to semantically enrich content in content management systems. Ontologies cover a wide range of disciplines enabling machines to understand meanings and reason in different contexts. We use ontologies for semantic annotation to facilitate understandability of the content by humans and machines. We call OCMS and ontologies evolve due to changes in the conceptualization, representation or specification of the domain knowledge. These changes are often substantial and frequent in relatively complex systems such as DBpedia1. Implementing the changes and adapting the OCMS accordingly requires a considerable effort. This is due to complex impacts of the changes on the ontologies, the content and dependent applications. We approach the problem of evolution by proposing a framework which clearly represents the dependencies of the components of an OCMS. We propose a layered OCMS framework which contains an ontology layer, content layer and annotation layer. Further, we present a novel approach for analysing impacts of atomic and composite change operations. The approach uses impact cancellation, impact balancing and impact transformation as a mechanism to analyse impacts of composite change operations. We propose a model which estimates the cost of evolving an OCMS using four criteria. The model ranks available evolution strategies and identifies the best strategy. The approach allows the ontology engineer to ex-ante evaluate the impacts and select an optimal strategy during the ontology evolution.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Fouad Zablith submitted on 19/Nov/2013
Suggestion:
Reject
Review Comment:

The authors present in this paper their work on ontology evolution. They propose reusing and extending some on the techniques for ontology evolution from the literature, and implement them in the context of content management systems. The added value is at the level of proposing ontology evolution strategies based on the change impact.

The paper is valuable in the context of presenting the list of semantic and structural change impacts, and change operations. The impact analysis framework is new. The literature review performed is good, but can also be extended as discussed next.

While the idea and the work show research potential to the community, the paper itself can be improved at various levels to make it better fit the journal. In its current state, the paper has weaknesses at the levels of presentation, methodology, and most importantly evaluation process.

At the level of presentation, the authors need to improve the positioning of the paper. First, why are they limiting their work to content management systems (CMS)? Is there any motivation behind this? Doesn’t their work apply to ontologies in general? If the answer is yes, which is probably the case, they are narrowing the exposure of their paper.

In the introduction section of the paper, the choice of examples is misleading, which is affecting how the reader would perceive the work. The authors start by talking about OCMS, but then give an example of DBpedia, and how complex its evolution can be. It would be better to either focus on an OCMS example as stated from the beginning, or opening it up to make the example of DBpedia work.

Another point related to the presentation is how the paper flows. It is hard for the reader to reach section 8 (the validation and illustration), to pick up the data and ontology details presented in Section 3. The example can be brought later in the paper. Also Section 3.3 is about the change impact analysis framework, and again Section 5 talks about it. Why not merge them?

Throughout the paper, the authors have a tendency to mix up their proposed ideas, with what is taken from the literature. For example, it should be clarified the origin of the OCMS concept, and whether they are proposing it. I am sure more and more web CMS platforms are integrating ontologies in their publishing process (e.g. Drupal, LOOMP.org etc). But maybe the authors’ focus is more on the internal organization’s CMS? This can be fed to the related work. Another example is the use of generic definitions like “Dependency is defined as a reliance of one node… (Section 4.2)”, where was this defined? It is advised to either add a reference of make it your own.

Concerning the methodology followed, there are bits in the paper presented, but not used in the approach. For example the sections related to “dependencies” (Sections 4.1, 4.1.1 and 4.1.2) are there, but then they are completely forgotten later in the process. Why are Algorithms 1 and 2 needed? Their purpose is to return the list of dependents, but then they are not used later in the approach? Furthermore, the presence of Algorithms 1 and 2 do not add value to the paper as they are doing what was already described in the FOL syntax within the text.

In the related work section, it is advisable to include references to CMS that use ontologies for annotations. Also there is another angle of change impact analysis that is based on the relevance of changes using for example statistical techniques (e.g. TF-IDF in Cimiano, P., and Volker, J. (2005). Text2Onto - A framework for ontology learning and data-driven change discovery. In Proceedings of the 10th Int. Conference on Applications of Natural Language to Information Systems (NLDB), pp. 15–17.) and others.

At the level of structural impact in Section 5.1.1, you consider the addition and deletion of entities as impact of changes, while they are actually changes. You explained why you made this decision, but this is not a convincing statement as they are still different. Also in Section 6.2, you consider them as changes, rather than structural impacts. This contradicts your initial statement.

Furthermore, there are functions used but are never defined. One example is on Page 7 there is the function directDependetAxioms that should be defined; another example is on Page 8 (StrImp and SemImp). Clear definitions of functions should be provided throughout the paper.
Another clarification is needed at the level of cost. Do the Cost of Evolution in Section 6.5, override the costs that were assigned in the previous sections? It is not clear how the system handles these two costs inputs.

The last part of this review is about the evaluation. Authors are advised to redo the evaluation by further taking into consideration the evaluators, data and evaluation methodology. Most of the conclusions are reached without enough background information. For example in Section 5.3.3, what data did you use to have 11.4%, 21.2% and 20.7% reduction of impacts? In Section 7, the scenarios focus on comparing the “cost of evolution” (1) when they are equal, to (2) when they are different. What about the other costs that you defined in your methodology?

The data used is not clear, which makes it hard to relate to what is being evaluated. Maybe a visualization (or link) of the ontology in context can help understand the change operations in Table 8.

You mention that 4 users did the evaluation, 2 experts, 1 general ontology user and 1 novice user. What is the reason behind this selection, and what is the difference between the general ontology user and the novice one? This number of evaluators is not enough, especially if you want to point out findings such as the ones reported in Table 9, or claim at the end that “some of the users pointed out that the presentation of the optimal strategy has affected their response”. Claiming that the change impact analysis performed with 100% accuracy is a strong claim. The evaluation data used here is not substantial to fully support this, making it hard to be generalized. It would be also good to highlight some cases where the proposed approach might fail, and its limitations.

Finally in Section 8.3, you compare your approach to the ontology evolution approaches within the NeOn Toolkit and Protégé. But from Table 10, it looks like you are comparing your approach to the ontology editors' features. This will weaken your analysis, as your comparison should be done against systems that provide similar functionalities.

Review #2
Anonymous submitted on 16/Dec/2013
Suggestion:
Reject
Review Comment:

This paper is unsuitable for presentation in this journal. It has three main drawbacks. The first is that the formal discussion is flawed, full of ambiguities, misuse/abuse of symbols and even clear errors. The second is that the paper is hard to follow, as it uses many terms without defining them, and there are many ambiguities in the informal presentation. The third is that the value of the work seems limited from my point of view.

The reason for the first two drawbacks above will be obvious in my detailed comments below.

For the third one, my justification is mostly related to the first part of the paper (sections 3-5), where, in my opinion, the treatment of change operations in terms of "impacts" is by conception flawed, because it is very difficult to identify all possible "impacts" of a certain change unless coupled with a formal model that identifies them. In addition to that, the interrelationships between changes would make this actually impossible, as there is no way to make a finite set of all the possible changes that could happen in an ontology, let alone identify their impact and means of treatment.

Furthermore, the strategies used are rather naive, and in most cases unsuitable for the "addition" operation.

In fact, this methodology is not novel; it was used in the past by KAON, OntoStudio and similar tools, where again evolution strategies were used. This includes works mostly by Stojanovic and colleagues (but also others) dating back in 2005 or earlier. These works provided a sophisticated analysis of the operators and the definition of the different strategies, and helped towards the realization that a complete listing of operations and their side-effects in an ad-hoc manner is impossible.

Having said that, some of the intuitions of the paper are fine, especially those described in section 6; but even there, the cost model adopted is highly arbitrary.

Detailed comments:

- The beginning of section 3, where the three layers are described, should probably be in the introduction. Up until that point, the setting of the paper was not too clear.

- The dependence is defined between nodes, but then used between nodes and edges as well (in subsection 4.1.2).

- The definition of indirect dependence does not capture transitively all dependencies, which is probably(?) the intention behind the definition.

- Total dependence should be written as: Dep(N_1,N_2) and Dep(N_1,N_3) \rightarrow (N_3=N_2).

- Algorithm 1 seems to miss several types of dependencies, e.g., what about the dependence of a class c with "rdfs:Class" in the triple (c, rdf:type, rdfs:Class)?

- Algorithm 2 is overly complicated; in fact, I was unable to decipher it, given the large amount of undescribed auxiliary functions that it uses. The algorithm could be much simpler; just find all dependencies of all nodes and then determine those N_i for which \{ Dep(N_i,N_j) | N_j \in N \} is singleton.

- What is "classAssertion" in subsection 4.1.2? I guess it is the same as the standard "rdf:type" property?

- What's a "restriction edge" (the term appears in subsection 4.1.2)?

- Not all strategies make sense in all types of changes. For example, the "cascade" and "attach-to-..." strategy do not seem to make sense in additions. Also, the strategies are not clearly defined. For example, how does one determine the "affected entities" in the "attach-to-..." strategy? What's a "parent" in the "attach-to-..." strategy? Does this mean "direct superclass"? Or maybe "indirect superclass"? Or maybe "depending entity"?

- Moreover, one strategy does not fit all changes and impacts. Different impacts and different operations should have different treatment. The use of one strategy for each change, no matter how complex and how many different atomic changes it includes, is obviously an oversimplification.

- The formula that "explains" the "attach-to-..." strategy (page 8) is not clear.

- In its original mentioning, in subsection 5.1, impact is totally unclear (it becomes clear later by means of Table 1). But in subsection 5.1, impact is defined as a function that maps and operator and some preconditions to its impact...? So is impact a function after all? Or is it something else that Imp is mapped to? Also, the exact formal definition of what is an "atomic change operator" and a "precondition" is missing.

- In subsection 5.1.1 there is some mentioning on integrity constraints. But there is nowhere in the text any definition on the types of integrity constraints supported, their form, and what constraints in particular are considered in this paper.

- In subsection 5.1.2, N_i, N_j are used sometimes as sets (N_i \subset N_j), sometimes as nodes/URIs (N_i, domainOf, P) and sometimes as numbers (N_i+N_j)!!! Also, E is used as a set (G=(N,E)), and also as an edge (E=(N_i,a,N_j)). To sum up, the entire subsection 5.1.2 is problematic from a formal point of view. Also, in the same subsection, in the definition of IE, it should be noted that there are no "invalid interpretations"; there are interpretations that do not satisfy a specific set of integrity constraints, which is probably what the authors meant to write. But again, what are those integrity constraints?

- According to the informal definition in page 9, all classes that are direct subclasses of "Thing" are orphans as well. Obviously not the intended effect. The formal definition below that is unclear, but would probably result to the same undesired effect.

- What's UC in the "impact" column of Table 1?

- The problem discussed in Subsection 5.3 is not addressed properly. This is a major issue, and a simple set of 4 rules that cover some obvious cases of interrelationships between the impact is not enough. There are several problems with this subsection. First, it contains a lot of repetition. Second, it is totally unclear whether the "Rules" presented will be applicable often enough to be useful. Third, there are many other cases which are not touched. Fourth, the rules themselves are unclear; for example, how can one "change the precondition" (rule 2)? What is a "counter-impact", and how is it defined (rule 3)? In Rule 4, the case of a class having multiple superclasses is not considered. Also, y, y' are classes (not sets) and \subset cannot be applied (unless classes are treated as sets, but this should be explicitly mentioned).

- At the end of section 5, some numbers are given, but without any mention on the dataset (input) that they were obtained from.

- Section 6.3 has two main problems:
1) In the first paragraph, it is mentioned that changing the Abox does not affect the Tbox, whereas a few lines below that the exact opposite is described as "preferable".
2) The last sentence of the first paragraph is unclear, and totally unjustifiable. What does it mean "sensitive to Abox statements", and what does that have to do with the expressivity of the ontological language?

- The analysis in subsection 8.1 is awkward (or I didn't understand it). The authors claim that the strategy selected by the algorithm is indeed optimal. But how is "optimal" defined here? It seems to me that the optimal is the one that has the least cost per the cost model defined; which is obviously what the algorithm will return, as, by definition, it follows the same cost model! Did I miss something?

Review #3
By Tudor Groza submitted on 19/Dec/2013
Suggestion:
Major Revision
Review Comment:

The manuscript presents a comprehensive framework for capturing, analysing and optimising the impact of changes in ontologies - and hence addressing the pressing issue of ontology evolution. In addition to taking into account the basic change operations, the framework models the evolution process in terms of dependencies across conceptual layers, types of impacts, optimal strategies for dealing with the change impact, as well as a method for computing the severity of the impact. The validation of the framework is performed in the context of an ontology-based content management system using a particular change operation and several scenarios for computing the severity of the impact of this operation. Finally, the authors also perform a usability study.

The authors should be commended for the effort put into this work, as well as in the manuscript. Overall, their framework is probably the most comprehensive change impact analysis solution to date, and in addition to its intrinsic value, the manuscript can very well be seen also as a review of the state of the art in the area. On the other hand, there are a few aspects that could be improved or at least addressed by the authors, and the most important ones are: a) the presentation of the manuscript (which sometimes is too abstract and other times too verbose); b) the justification of some of the design decisions - in particular in the context of the severity of the impact and the cost of evolution; c) the evaluation, which is, in my opinion, the weak point of the manuscript - especially since the effort put into the development and description of the framework does not seem to be very well balanced with that put into understanding its capabilities and limitations. Detailed comments are provided below.

1. Introduction

The major flaw of the introduction is the lack of a clear placement of the work in the general context of ontology evolution. The authors start by discussing the use of OCMS, without providing some concrete examples - can you name a few well-known OCMSs? - or drawing a line between the application of their framework in general to ontology evolution as opposed to in this OCMS context. This separation is extremely important because the rest of the technical description relies on a series of assumptions - such as the presence of the layered architecture introduced in the manuscript, which might not necessarily be the case in a typical ontology authoring / engineering setting. Finally a more concrete listing of the novel aspects introduced by this manuscript is required - i.e., provide a clear set of contributions in comparison to [12, 13, 14, 15, 16] (here, the authors could include the "literature overview" value of the manuscript).

Some other comments:
* The use of citations throughout the manuscript is fairly confusing. For example, why is [2] provided in that context of the first paragraph ([2] is Gruber's paper on ontologies, while the actual context set by the paragraph is about OCMS)? Similarly, the authors tend to use "coupled" citations in several places - e.g., [5][6] or [7][8] in the 3d paragraph of the introduction, but also in many other places in the manuscript - which are not justified. Does a general statement such as "a change of one entity may cause many unseen and undesired changes and impacts on dependent entities" require 2 citations? - the same applies for the next statement in the same paragraph.
* The use of DBPedia as an example of a large knowledge source is not really justified in that context, because the real issues appear in complex ontologies - i.e., the ones with heavy semantics (e.g., deep classification hierarchies, large number of axioms, multiple inheritance) - and not necessarily in large volume, yet flat knowledge bases, such as DBPedia. The authors should be able to provide a more convincing example here.
* "Moreover, a given change request can be realised using different evolution strategies" - please provide an example.
* The language could be improved throughout the manuscript - e.g, "This is achieved by embedding semantics by annotating the target content using ontologies" -> "This is achieved by leveraging semantics via ontology-based annotation of the content".

2. Related Work

This is, in principle, fine, although the language could be improved:
* 2nd paragraph: "The work … The work …"
* 3rd paragraph: "The authors [7][20][21] have proposed …" -> The research described in [7][20][21] proposes …
* 3rd paragraph: "Furthermore, the authors give emphasis …" -> "Furthermore, the authors emphasise …"

3. OCMS Principles

Before discussing the actual framework, it is probably worth restating the application or focus of this work. Again, it is not clear if the framework is targeted strictly towards OCMSs - which is quite a limitation in my opinion since they're not that many of them used in real-world settings - or is generic and hence applicable to any ontology engineering / authoring context.

With respect to the first part of section 3 - most of it is unnecessary, and can be reduced to one paragraph. The authors should have the readers and Journal in mind when writing. For example, the description of the Ontology Layer is, in fact, the general definition of an ontology and it should be described as such and not as a contribution of an ontology layer in the OCMS. Furthermore, taking into account the readership of the journal, this whole section is de facto knowledge and can be left out.

Some other comments:
* In 3.1 is not clear if the provided example is a running system or a toy example created for illustration purposes only.
* Why was it necessary to introduce so many new ontologies to deal with aspects as simple as a document structure - surely the authors could have used some of the existing ones? - see for example DOCO (http://lov.okfn.org/dataset/lov/details/vocabulary_doco.html)
* Does the DocBook ontology really require 3 citations? [31, 32, 33] ?
* The Help and Software ontologies are underspecified - What are these? Why are they really needed? What are some concrete examples of their use? Are they known ontologies or they were created for the purpose of this exercise?
* "The domain ontology is also known as the application ontology" -> fairly strange statement - could the authors rephrase it?
* The description of this application ontology is also underspecified
* The graph notation and the example described in 3.2 introduce a series of arguable aspects. For example, the use of properties as both nodes and edges is confusing - especially later on in the manuscript when change operations are applied on them. Furthermore, Figure 3 contains, beside some typos (e.g., instaneOf -> instanceOf), the rdfs:instanceOf property which is undefined. Could the authors clarify this aspect?
* Please add a reference to the set definitions on the following page in the context of the edge definition (i.e., that large union in the Ontology Graph paragraph)
* "The edges are referred as triples" -> not quite clear what the authors try to state here (the Annotation Graph paragraph on page 5)
* The Attributes of the Graph description mixes formal definitions with programmatic methods - it would be better if the authors would be consistent and use only formal definitions. Moreover, this paragraph is not really needed.

4. Change Request Capturing and Representation

* Firstly, the title could be improved - e.g., "Capturing and Representing Change requests"
* The authors could improve the readability of the first paragraph by adding a couple of examples. Also, the multi-citation issue is present here again - [5][39] and [40][41] - are they really necessary? can the authors provide a better context for these citations?
* In 4.1, in the last sentence - "we identify dependencies that are useful for implementing …" - how exactly is this identification performed?
* 4.1.1 could be significantly reduced. Both algorithms presented here are unnecessary - they are very basic ontology / graph operations that bring no added value to the manuscript and take a lot of space. The authors should probably resume here to providing a simple schematic definition for the three types of dependencies
* In the definition of the partial dependency - since N1 is defined as partially dependent on N2 Pdep(N1, N2) -> why is N2 then mentioned in the existential quantifier?
* In 4.1.2: "Using an empirical study …" -> could the authors provide additional details on this study? A better justification of the list of chosen dependencies is required.
* Concept-Concept dependency - this is where the use of Properties as both nodes and edges becomes confusing. The textual definition of this dependency mentions "concept nodes", while the formal definition restricts it to classes. In order to enable a clearer and easier way to follow the various definitions, the authors could perhaps simplify the graph notation, or start the description of the framework with a terminology subsection that clarifies the use of certain terms, such as 'concept'. Furthermore, a short discussion on the effect of using other properties as foundation for this type of dependency would be interesting - e.g., in the biomedical domain (in particular contexts) "part-Of" is an equally important relation - how do the authors see this being integrated into their list of dependencies?
* In 4.2 in Cascade Strategy: an example is required to help understand this strategy, specifically in the context of the statement: "In case of addition, when we add an entity, we need to add all other entities that make the new entity semantically and structurally meaningful."
* In 4.2 in Attach-to-Parent / Root Strategy: how does this strategy look like for something else than "Delete"? Can the authors provide an example? ("… link all affected entities to the parent entity …")

5. Change impact analysis process

This section is fairly well structured and written and it provides a comprehensive overview of all the aspects required to be captured in the process of analysing change impact. There are only a few things that could be improved:
* some of the types of impacts could have better names: e.g., 'entity more / less described" reads slightly strange. Similarly, 'entity incomparable'
* in 5.2 the reference to Table 2 should probably be a reference to Table 1
* Algorithm 3 is again not needed
* In 5.3 an example of a composite change would improve readability

6. Optimal strategy selection

While the aspects described so far in the manuscript are to a large extent derived from previous work, in my opinion, the computation of the severity of impacts and the cost of evolution are the most important contributions of the work, however unfortunately, also the ones that were the most shallowly treated. It is understandable that the authors went for the most straightforward options when creating the computational framework, but then a clear justification is required, in addition to a discussion on how would it look like if one would like to go beyond the simple linear aggregation of the impact weights or strategy costs.

For example, in 6.1: "… we use heuristics to measure the severity … such as tolerance, […] amount of time and expertise required …" - all these elements are important, yet hard to quantify, and thus, it is quite disappointing to see everything reduced then to a 'heuristically' chosen value of 0.6 or 0.8. The authors should try to do a much better job in justifying the choice of values if a proper framework for computing them is not provided. This is particularly important, because the cost of evolution and especially the validation process depends on these values.

7. Validation

The validation description is, unfortunately, the weakest aspect of the manuscript because it provides a very limited view over the full capabilities of the framework. Assuming that the example provided in the manuscript is enough to exhibit the application of the framework (although it would have been good to provide at least another operation - for comparison purposes), a detailed and quantitative (not only empirical) analysis of the choice of weights is crucial. Otherwise, there's really no difference between what the authors provided and a random choice. Furthermore, the authors motivate their work, in the beginning of the manuscript, via the need of a framework that is able to handle large and complex ontologies. However, the validation is performed on a small dataset - which is unrealistic for a proper, real-world context. Hence, unless the authors expand their validation to a much larger scale, they should at least discuss the behaviour of the framework in such a setting.

8. Evaluation

The usability study detailed in this section is similar, in setting, to the framework validation - i.e., low number of participants, underspecified study details, shallow results analysis. The authors should at least provide the complete study design, including aspects such as the time allocated for learning the tools or details on previous experience with ontology engineering tools and the OCMS framework. This should be then complemented with a detailed discussion of the results. For example:
* a score of 3.3 in the last question is not so positive - what is the reason behind / interpretation of this result.
* "The cost estimation is suitable to measure impacts" = 4.0 - in my opinion, this is a very surprising result considering that the participants should have a deep understanding of the weight assignment scheme and of the actual impact of the change operation, which is quite hard to believe that a general ontology user and a novice user may have.
Finally, 8.3 should be expanded with a more detailed discussion of the state of the art tools.