A content-focused method for re-engineering thesauri into semantically adequate ontologies using OWL

Tracking #: 771-1981

Daniel Kless
Ludger Jansen
Simon Milton

Responsible editor: 
Krzysztof Janowicz

Submission type: 
Full Paper
The re-engineering of vocabularies into ontologies can save considerable time in the development of ontologies. Current methods that guide the re-engineering of thesauri into ontologies often convert vocabularies syntactically only and ignore the problems that stems from interpreting vocabularies as statements of truth (ontologies). Current reengineering methods also do not make use of the semantic capabilities of formal languages like OWL in order to detect logical mistakes and to improve vocabularies. In this paper, we introduce a content-focused method for building domain-specific ontologies based on a thesaurus, a popular type of vocabulary. The method results in a semantically adequate ontology that does not only contain a semantically rich description of the entities to be modeled, but also enables non-trivial consistency checks and classifications based on automated reasoning, and can be integrated with other ontologies following the same development principles. The identification of membership conditions, the alignment to a top-level ontology and formal relations, and the consistency check and inference using a reasoner are the central steps in our method. We explain the motivation and sub-activities for each of these steps and illustrate their application through a case study in the domain of agricultural fertilizers based on the ACROVOC Thesaurus. Foremost, our method shows that simple syntactic conversions are insufficient to derive an ontology from a thesaurus. Instead, considerable structural changes are required to derive an ontology that corresponds to the reality it represents. Our method relies on a manual development effort and is particularly useful where a highly reliable is a hierarchy is crucial.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Stefano Borgo submitted on 30/Sep/2014
Major Revision
Review Comment:

This paper proposes a methodology to build a formal ontology from a thesaurus. The thesaurus is seen as structured resource that can be harvest to reduce the amount of time needed to build an ontology from scratch. The methodology is generally well organized, mostly motivated and well described; some examples are given by applying it to an existing thesaurus.
The paper is well structured, the presentation is fairly clear, the english is good, the topic is important and within the journal's scope.
This reviewer finds the approach and most of the results reasonable and interesting with helpful intuitions and examples. However, the authors make some choices that remain undiscussed and at times unclear. Some aspects seem to be inconsistent and need to be better presented/explained. My comments, following the text, are collected below. Some of them are general and impact the soundness of the approach; since I believe these can be quite problematic given the general framework, I request major revisions.

General observation: each step/action not exemplified in the application (for various reasons) needs to be exemplified in some other way. This is not the case now as I will point out in the comments.

From the introduction: “Controlled vocabularies…often incorrectly referred to as terminologies”, please explain and motivate.

“It is only ontologies using OWL that give hope for integrating independently developed ontologies in a logically consistent way and with correspondence to the represented reality.”
1) generally OWL is preferred to other logics (in particular FOL) because of its computational properties but this is not taken as a motivation for your work, so the choice to use OWL cannot be just logical consistency (or formal semantics) and is not really justified;
2) the goal of a thesaurus is (often if not always) not to represent reality but to represent how terminology is organized. In particular, the focus is on the conceptual perspective. This is recognized in the paper in many places e.g. when saying that “such vocabularies […] contain several thousand up to hundreds of thousands of concepts”. One strong consequence of the realist position, at least as presented by the Buffalo school which is here correctly cited, bans even the use of the term `concept’. This shows a need for an explanation: what can we say to claim that the adoption of the realist viewpoint (wrt a more cognitive -or at least neutral- approach) is correct in this context? As far as I can get form the paper, the argument is that “there are authors […] particularly emphasizing the stance of ontological realism.” This is too weak as a motivation. It seems to me that the paper is mixing two issues: building an ontology from a thesaurus and building an ontology that follows a philosophical setting. It is unclear why these two issues are mixed up.
3) It would be interesting and important to know if and where the structure of a realist ontology can make a difference in the methodology (wrt a more cognitive/neutral ontological approach). If so, I suggest you emphasize this as a valuable result of your research and identify it with examples.

You write “Further, our method aims at developing a semantically adequate ontology[…]” and list: full use of the semantic expressivity of OWL, integration with other ontologies, consistency and reasoning results. These are good points but do not suffice to turn an arbitrary system into a semantically adequate ontology. Since the expression is relative to a goal (to be adequate “for” something), here you should remind the reader what the goal of the sought ontology is.

On the methodology: “The reason why we focus on the reengineering of thesauri is that there are structural differences between different types of vocabularies (e.g. simple lists of terms, thesauri, taxonomies or classification schemes [28]) and their reengineering may differ.”
This is an important observation and I fully agree. However, your methodology does not apply to ISO thesauri only since it starts with a check of the correctness of the thesaurus. Thus, being a thesaurus according to the ISO standard is not necessary. You now must make explicit what are the constraints a generic resource must obey to be used by the described methodology.

From Figure 1 it seems that one could work out all the steps up to position 7 and then being forced backwards up to position 1. This is awkward and one can doubt the robustness of the methodology. Can you divide the 7 steps in blocks so that once you move from a block to another you know that the steps from the previous block won’t be repeated again? Is this possible at all?

The name of step 4 should be improved since is sounds weird to “align” relations.

“Such relationships would be considered erroneous in term-based thesauri and should be “transferred” to concept-to-concept relationships, just like the relationships between preferred terms.”
The closing of this sentence is not clear, do you mean “and a similar change should be applied to the relationships between preferred terms” ?

Sect. 3.2, pg.6 “While, in principle, a choice between formal languages can be made, we focus on the popular OWL in its 2nd version”
Part of the content in point (a) does not belong here but to the application example, the methodology should tell about which language to choose depending on some conditions or set minimal constraints on the languages one can use.
This section is a bit a mix between specific and general considerations. I suggest to focus more on the general description of the step and the principles guiding it and, within this, to add pointers to the following application section for specific examples.

“Since modeling relationships between relationships is not subject of thesaurus work, there will be no use of the object subproperty axioms to assert generic relationships.”
I’m not sure I understand your point, could you explain/expand this sentence?

In the application section of pg. 8 you say “It turned out to be not useful to follow the actions described for this step in the case of the AGROVOC thesaurus.” This means that the methodology is not general and that you should expand it at least to include your own case.
I was expecting also examples of problems with the hierarchical relationships and of the mentioned workarounds towards the end of the application section but none is given (you just say that you take the is_a relationship as in AGROVOC). Actually, it is unclear whether the proposed separation between a syntactic step (step 2) and a more semantic oriented step (step 3) is correct since you give semantic arguments discussing concepts and relationships at step 2 also. There is something unsettled in this part of your methodology: while the distinction is fine in general terms, work does not seem to split in this way in practice. What can you say about this?

“It is also desirable to identify necessary and (jointly) sufficient membership conditions that define a class, because it is only defined classes under which other classes can be subsumed by automated reasoning.” Something is missing in the sentence.

Pg. 9 “Sometimes the terms have even multiple meanings in a single community, particularly if there are different schools of thought. In such cases, an ontology may need to contain several classes for a given term, each for every meaning.”
As said a the beginning, I’m confused about the actual goal: are we ontologizing a thesaurus or using a thesaurus to build a (realist) ontology? If the second, then you need to add an initial step to evaluate whether the thesaurus is suitable, i.e., if it can be taken to address reality (whatever one takes that to mean). In such a case, your methodology does not apply to any system classified as a thesaurus by the given ISO. If you don’t start with this initial step, you enter in an unclear loop where you combine two orthogonal goals: ontologizing a structure and changing its content. I’m afraid that requires much more attention on how and when to change the thesaurus content.
For example, at this point we already have identified a concept for each meaning of the terms (according to the thesaurus). Assume in another community, perhaps more scientifically or reality oriented, there are more distinctions for the same term, is this something to consider? how is this taken into account in the methodology? what if there are incompatibilities between your system and the other? This opens a series of problems…

Pg 11 “Since our approach is a scientific one, but also because the AGROVOC thesaurus did not provide any disambiguating hint, we used the reference to carbon to characterize the class ‘organic fertilizer’”
I believe that if this issue where spotted at step 1 you would have introduced two distinct concepts, one for each meaning. Why don’t you do the same here? Later you say “For example, we had to choose between different interpretations of ‘organic fertilizers’ and to decide what to count as ‘plant micronutrient’.” These comments, to be useful, must be combined with criteria or at least suggestions for deciding when one has to choose a meaning out of many and when one has to extend the number of concepts/classes!

Figure 5 is not very informative, it can be dropped. What is relevant (the elimination of some terms) is already explained in the text.
Instead, the relationship btw these classes and those of dispositions is quite important and should be addressed in details (this issue comes up also below)

“as well as to their state after they have been applied to the field and bound or solubilized plant nutrients.” Something missing in the sentence.

Regarding “guidelines for the choice between top-level ontologies”, perhaps you can look at the work of M. Keet from University of Cape Town, South Africa. Although from the foundational viewpoint I don’t consider her approach to be optimal, for the goals of this methodology it could be of help (just a suggestion).

The index of footnote 1 is not correctly displayed (in the text) at pg. 14

“This may be said a weakness of Protégé,” -> “This may be considered a weakness of Protégé,”

“In many cases, the urge to introduce new relationships is due to an insufficient ontological analysis.”
Agreed, very important point. Here an example is needed (since not in the following application description) otherwise the observation would not be understood.

An example is needed after the general claim: “The generic relationships in a thesaurus are prima facie candidates for becoming is-a relationships in an ontology. Since they may be mixed with hierarchical whole-part relationships in a thesaurus, organizing thesaurus concepts into an is-a hierarchy may imply re-combining fragments of the thesaurus that are not related by properly applied generic relations. This, in turn, may require introducing new classes to connect these fragments.”

“We must declare ‘fertilizer’ to be such compound since there is hardly any pure fertilizer material in real-life environments.”
Yet, you might want to talk about the substance in isolation (and it can be isolated): the choice of classifying it as a compound must be explained/motivated further.

“to which classes the newly introduced classes” (drop the first occurrence of “classes”)

“It would have been a tremendous advantage, if ChEBI had been more mature in terms of the mem- bership conditions specified for its classes. It would have saved us tremendous time and spared us to deal with amendments of ChEBI.”
This comment is not useful per se, you should rephrase it into a suggestion to prefer mature and well specified domain ontologies (but how to judge them?)

“It was not absolutely clear to us, if we should model a fertilizer disposition, a fertilizer function or even a fertilizer role. These distinctions need better clarification and guidance. This problem also applies to BFO [51], [52].”
This is a crucial point with no much guidelines, your experience is important: you should report on your choices and the motivations, even if they are just examples and you cannot give general principles.

“The formal specification of classes is realized by adding the necessary membership conditions identified in step 3 as anonymous superclasses using the subclass axiom. …” This part is too specialized, you can move it to the application section.

“and possibly description logics [in] general.”

The following claim relatively to class A is puzzling when one can use a relationship as well as its inverse: “any relationship from a class A to another class B has always the role and logical force of a necessary membership condition for the class A”.

There is an interesting paper at the WOP 2014 workshop by Elena Cardillo et al. which discusses the same topic of this paper with a similar approach. Please, include it in the literature review and state what differentiates the two approaches.

Explain the “**” (double star) in Appendix 2, step 3
Step 4, point a: are these one or two distinct choices ? if they are distinct, can you give an example? why is there no specification about what to do with the chosen formal relations?

Review #2
By Eetu Mäkelä submitted on 01/Oct/2014
Major Revision
Review Comment:

I very much like the core of the paper, which contains a well-thought, well-sourced and well-discussed method for transforming a thesaurus into an expressive OWL ontology. However, I also have numerous problems with the paper.

First, at times the text isn't sufficiently thought out, finalized, focused, consise, or clear. This makes the article a chore to read. This concerns particularly the introduction chapter, the start of which could really use a native language editor, but applies also to varying extents to other parts of the article (particularly section 3.5). Going through the whole text with a thick comb, pruning out unnecessary side threads and in general clearing up the exposition would do a lot of good.

Frankly, I think even cutting some whole sections could be warranted to make the core argument more concise and clear. For example, the discussion on choosing ontology editors in section 3.5 didn't seem to add any value to the main argument, and felt more like it was done to be able to cite a few more papers. I encourage the authors to think on their citing habits also in general. Especially in going outside the main thrust of the article at hand, probably just one main reference would be enough instead of the current one to four.

As regards organization of the paper, I think the presentation of ABoxes and TBoxes in general should belong in the introduction instead of the related work section. I am also not completely satisfied on how the method is laid out in terms of a single process of steps following each other (even though it is stated that these have to be applied iteratively). In my opinion, while some steps clearly act as prequisites of others, this does not hold through the whole process, and the description of the method would do better to just more clearly lay out these dependencies and the possible and necessary interactions between the tasks (with the understanding that when such dependencies are not specified, the two steps can be completed in any order).

These mostly presentational details aside, my chief concern content-wise is how the method is justified, contrasted and aligned to others in its field. Two related issues here are the amount of hand-waving and impreciseness, as well as seeing more contrast than there actually is.

Starting with impreciseness, the introduction mingles together RDF semantics with RDFS semantics, even though they are separate. It also lumps all of RDFS in with "worst case RDFS" when comparing to OWL and also falsely contrasts OWL with the previous in matters such as XML-like syntaxes and support for IRIs and URIs. The whole discussion on OWL vs RDF also doesn't make any sense to me, when I think the issue is actually between OWL-inconsistent and consistent ontologies.

This same problem concerns also the discussion contrasting "ABox re-engineering" with "TBox re-engineering". The ABox and TBox distiction only makes sense with regard to OWL semantics, and not with regard to e.g. SKOS semantics. For this reason, SKOS semantics also cannot be categorized and judged wholely from the OWL viewpoint. A much more fair dealing would be to discuss the actual intent in other methods, as well as the inference capabilities offered by them - for example, a lot of the referenced other methods DO put an effort into creating a valid is-a hierarchy for the ontology. This is the issue that should be compared, and not if that hierarchy is directly actionable by OWL-reasoners (or if you have to have a SKOS reasoner to do it). Doing it this way would also much better align the method described in this article, in a more fine-grained manner, to the others referenced.

Finally, it really isn't fair to first state that you think "semantically adequate"=(fully OWL-based) ontologies are superior to others when you then can't back that up in discussion. In section 4 you explicitly state that "The overall benefits of a semantically adequate ontology as opposed to a thesaurus need to be subject of further investigations." A particularly interesting omission here is the fact that in section 3.5 it is said that the OWL reasoning enabled by the described method "revealed various initial modeling mistakes". Yet, the paper doesn't go into the details of these at all, even when doing so could lend credence to the argument that the arduous process of membership condition discovery and encoding could be worthwhile.

In summary, I really think there is an article here worth putting out, but it would benefit tremendously from both a tightening up, as well as from a more precise, yet also neutral and conversational attitude to its environment.

Review #3
By Leo Obrst submitted on 07/Oct/2014
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper is very nearly ready to be accepted for publication. I will highlight below a small number of minor issues which should be addressed in a slight revision of the paper.

The paper is focused on the re-engineering of a thesaurus to be a sound OWL ontology. As such, it provides a very good and detailed scheme/methodology for transforming a thesaural term or lexical resource with hierarchic structure (a controlled vocabulary or so-called knowledge organization system, or quasi-conceptual classificational system) into an ontology, i.e., a conceptual/real-world categorical resource.

It is, however, much more than that: it represents a sound and principled methodology for developing ontologies, especially but not exclusively OWL ontologies, and as such is a major contribution to the Semantic Web and ontological engineering communities. Because the focus is so strongly on refining content, and the issues involved in deciding what constitutes a good ontology (though initially from the point of view of transforming a thesaurus), it represents also a contribution to the subject area of ontological analysis, and as such would be in fact a welcome contribution to a purely content-focused journal such as Applied Ontology, in addition to being very appropriate for the Semantic Web Journal.

The paper is well-written and well-structured, and painstakingly explains the authors’ T-box re-engineering method, applying it as a case-study to a subset of the AGROVOC Thesaurus, that of the sub-domain of fertilizers. Even more detail is elaborated in a number of appendices. Finally, the paper situates and compares its approach against other proposals in the literature, highlighting the distinctions between T-Box and A-Box methods, and potentially errant views of those.

At 35 pages, the paper is rather long, on the order of a book chapter in length, but in this case the length seems well-motivated, permitting the authors to delineate the steps of their approach, provide clear examples in the chosen sub-domain at each step, and so assists the reader in clearly understanding the issues and motivations.

I have only a few, relatively cosmetic suggestions that will make the paper better, as follows, and assume that the authors can make these rather quickly:

p. 3: “understanding our method will help understanding its difference” => “understanding our method will help in understanding its difference”

p. 6: “as a relatum that we would have to consider as structural relationship” => “as a relatum that we would have to consider as a structural relationship”

p. 7, Figure 4: there seems to be a missing arrow or even 2 arrows, one from the “hierarchical part-of-relationship” box and one from the “associative relationship” box, going into the currently orphaned “object subproperty” box, i.e., additional “corresponds to” arrows.

p. 12: “we introduced a class ‘plant_nutrient_disposition’, comprising all instances” => “we introduced a class ‘plant_nutrient_uptake_disposition’, comprising all instances”. Perhaps a copy error.

p. 12: “Except rejection of some classes …” is not a sentence. Perhaps attach it as an extension of the previous sentence or use a different transitional conjunction.

p. 14: “or the upper levels of CyC [55]–[57]1.” => “or the upper levels of CyC [55]–[57]” [ADD FOOTNOTE, i.e., a superscript numeral 1].

p. 13-14, Section 3.4.1 discussion: Contrary to the given statement, there are a couple of potential references that compare or suggest bridging upper/top-level ontologies:

Baumgartner, Norbert; Werner Retschitzegger. 2006. A Survey of Upper Ontologies for Situation Awareness. Proceedings of the International Conference on Knowledge Sharing and Collaborative Engineering (KSCE '06), St. Thomas, USA, November, 2006. http://www.bioinf.jku.at/publications/ifs/2006/1306.pdf.

Mascardi, Viviana; Valentina Cordì; Paolo Rosso. 2006. A Comparison of Upper Ontologies. Technical Report DISI-TR-06-21. http://www.disi.unige.it/person/MascardiV/Download/DISI-TR-06-21.pdf. See also: http://www.ontology4.us/Ontologies/Upper-Ontologies/Comparison/index.html, which is based on the above paper.

Mascardi, Viviana, Angela Locoro, Paolo Rosso. 2010. Automatic Ontology Matching Via Upper Ontologies: A Systematic Evaluation. IEEE Transactions on Knowledge & Data Engineering 2010 vol.22 Issue No.05 – May, pp. 609-623. http://www.computer.org/csdl/trans/tk/2010/05/ttk2010050609-abs.html.

Obrst, Leo; Patrick Cassidy; Steve Ray; Barry Smith; Dagobert Soergel; Matthew West; Peter Yim. 2006. The 2006 Upper Ontology Summit Joint Communiqué. Journal of Applied Formal Ontology. Volume 1: 2, pp. 203 - 211, 2006.

Obrst, Leo. 2010. Ontological Architectures. Chapter 2, pp. 27-66 in the book: TAO – Theory and Applications of Ontology: Computer Applications, Roberto Poli, Johanna Seibt, Achilles Kameas, eds. September, 2010, Springer.

Penicina, Ludmila. 2013. Choosing a BPMN 2.0 Compatible Upper Ontology. eKNOW 2013 : The Fifth International Conference on Information, Process, and Knowledge Management. http://www.thinkmind.org/download.php?articleid=eknow_2013_5_30_60120.

Semy, S.; Pulvermacher, M.; L. Obrst. 2005. Toward the Use of an Upper Ontology for U.S. Government and U.S. Military Domains: An Evaluation. MITRE Technical Report, MTR 04B0000063,November 2005. http://www.mitre.org/work/tech_papers/tech_papers_05/04_1175/index.html.

p. 14: “(i.e., in the Protégé lingo, all necessary object properties).” => should be in the OWL lingo, since object properties are notions from OWL, not Protégé.

p. 14: “We must declare ‘fertilizer’ to be such compound” => “We must declare ‘fertilizer’ to be such a compound”

p. 20: “using OWL and possibly description logics general.” => “using OWL and possibly description logics in general.”

p. 21: “Nevertheless, the integration of synonymous” may be ok, if the focus is on the property. Otherwise => “Nevertheless, the integration of synonyms”.

p. 24-25: Some citations are in bold font here: why? Examples: Hahn [7], etc.

p. 28:
[67] E. Beisswanger, S. Schulz, H. Stenzhorn, and U. Hahn, ‘BioTop: An upper domain ontology for the life sciencesA description of its current structure, contents and interfaces to OBO ontologies’, Appl. Ontol., vol. 3, no. 4, pp. 205–212, 2008.
[67] E. Beisswanger, S. Schulz, H. Stenzhorn, and U. Hahn, ‘BioTop: An upper domain ontology for the life sciences - A description of its current structure, contents and interfaces to OBO ontologies’, Appl. Ontol., vol. 3, no. 4, pp. 205–212, 2008.

p. 30: The format is off, for the last sentence.