Transition of Legacy Systems to Semantic Enabled Application: TAO Method and Tools

Paper Title: 
Transition of Legacy Systems to Semantic Enabled Application: TAO Method and Tools
Hai H. Wang, Danica Damljanovic, Terry Payne, Nick Gibbins, and Kalina Bontcheva
Despite expectations being high, the industrial take-up of Semantic Web technologies in developing services and applications has been slower than expected. One of the main reasons is that many legacy systems have been developed without considering the potential of the Web in integrating services and sharing resources. Without a systematic methodology and proper tool support, the migration from legacy systems to Semantic Web Service-based systems can be a very tedious and expensive process, which carries a definite risk of failure. There is an urgent need to provide strategies, allowing the migration of legacy systems to Semantic Web Services platforms, and also tools to support such strategies. In this paper we propose a methodology and its tool support for transitioning these applications to Semantic Web Services, which allows users to migrate their applications to Semantic Web Services platform automatically or semi-automatically. The transition of GATE system is used as a case study.
Full PDF Version: 
Submission type: 
Tool/System Report
Responsible editor: 
Krzysztof Janowicz

Review 1 by Patrick Maué

Ok, the licensing section is good, but contains some typos (highlighted in the text)

The TAO project partners are currently maintaining and will further develop these software components. ** LANINO ** (LATINO i guess) will be developed further by one of the TAO project partners "" the Jozef Stefan Institute (JSI) as part of several ongoing EU projects. KCIT and the other related components have been integrated into GATE, ** and** are developed further as part of the GATE development process. The latest version of GATE 6.0 was released in November 2010 and a new release is planned for May 2011. HKS is further developed as part of **the** OWLIM Semantic Repository (OWLIM) by Ontotext.

Besides that I recommend to accept it.

the following reviews address a previous version of the manuscript.

Review 1 by Patrick Maué

My comments have all been considered in the new version. The paper improved, and in general I suggest to accept it. There is just one remaining (and in my opinion critical) issue: This paper has been submitted for the special issue for "Semantic Web Tools and Systems". It reports on the TAO Suite, a set of tools developed within the research project TAO. The TAO Suite is available as download from the project website. But it is not clear under which license this software has been developed, and I doubt there have been any improvements since the project funding has ended. Have a look at the discussion about the journal issue here (

From my experience, there is usually no infrastructure (and financial interest) in setting up a sustainable infrastructure for project websites (including all the downloads only advertised through this websites). I expect that a special issue about tools only reports on software which is available online and is actively maintained. I would therefore suggest that the authors move the presented TAO suite to an open source project hosting platform, and setup the infrastructure which supports maintenance and community building. This includes access to the source code, issue tracking for feedback, and some training material illustrating how to setup and extend the software. The existing website should be updated with links to the open source project. The authors should respond with their strategy how the domain will be managed (and will be still available in let's say ten years). Or they update all their links in the paper to a persistent URL which directly points to the open source project (which is far more likely to be around in ten years)"

Review 2 by Todd Pehle

I particularly like the fact that the tool is used as a hybrid semantic/IR-type tool. Having a wide range of search tools in the tool belt is definitely beneficial and blends functionality well for users of a system.

These reviews are for the first submission and led to a 'reject and resubmit', the current PDF version contains the resubmission.

Review 1 by Patrick Maué

*General Thoughts*
This is a well written paper, summarizing the outcomes of the TAO project ( TAO provides a platform to migrate legacy software into the Semantic Web, by semi-automatically creating ontologies from the source code and additional documentation, and annotate (or augment as the authors say) the service descriptions with links to these ontologies. One paper alone is not much to report on all the results delivered in the TAO project. But the authors tried, and refer to a project deliverable whenever it really starts (at least for me) to get interesting. This is basically my main issue with this paper. In some parts it rushes through technical details. It is then really hard to follow, the authors expect too much from the reader here. All together, an interesting paper to read, and I can only suggest everyone to have a look at the TAO project.

Figures 3-5 can't be read once printed out, please provider version with higher resolution. The list at the end of section 3.1 spans across the two columns.

The title "Transition of Legacy Systems to Semantic Enabled Application" contains already two mistakes
(1) it is either "semantically enabled" or "semantic-enabled" (compound adjectives should be hyphenated if the first doesn't end with -y)
(check here:, this happens several times in the paper
(2) Application should be plural (since systems is)

The first sentence also contains one mistake (depending what you wanted to say. It is either
(1) Semantic Web and Semantic Web Service "technologies" have been ... (add technologies are alternative)
(2) _The_ Semantic Web and Semantic Web Service_s_ have been ....

Section 1
the industrial take-up .. have been -> has been
into semantically web-enabled <- is that "semantic- and web-enabled environments or "Semantic Web"-Enabled environments?
which facility the migration -> facilitates
this paper reports a new -> reports on a new

Section 2
Rephrase the first sentence.. make at least three out of it ;)
"contant augment tool" -> in the figure it is "augmentation", stick to it
transition.. to semantic enhanced -> semantic-enhanced
applications of semantic enrichment of software -> for semantic enrichment
TAO methodology -> again, start with "The"
comparing with -> compared with
limited to texts data sources -> text-based, textual or just text data sources
structure documents -> structured documents (or are the documents meant to structure something else?)
TAO Suite includes ... , has been developed -> remove "has been developed"
is identifying -> is to identify
Web services interfaces are usually developed in WSDL documents -> Web service_ interfaces are usually described through/documentad as/... WSDL documents (you develop _in_ Java)
SA-WSDL is one of the lastest W3C recommendation -> "is the latest recommendation" or "is one of the latestet W3C recommendations" (the latter doesn't make sense though)

Section 3
rely on experts' options and common sense -> opinions?
follow-on -> follow-up

I am not a native speaker myself (but there are some of the co-authors, please correct me if I am wrong with my suggestions): try to make sure that you use articles (The, A) more often, and check that you have the plurals right.

*Detailed comments for the individual sections*

"exhibiting huge commercial potential"
do you have a citation for that?

_Methodology cookbook_
"The ontology learning tool is used to derive a domain ontology from legacy application documentations"
In my opinion you can't derive "domain" knowledge from application-specific ontologies. What you get here are application ontologies, which you can later either align to domain ontologies, or map to other ontologies either on the same level (application-specific) or any other. Have a look at Ontology Architectures (e.g. the paper by Guarino)

"[..] these types are typically datatypes based on XML Schema, rather than richer knowledge-based types [..]"
Datatypes always follow a certain underlying model (i.e. graph-based, relational, object-based, ... ), the schema (i.e. XML, RDF(S)) is an explicit encoding of this model (to enable interoperability). This concept of "knowledge-based types" doesn't really make sense: do you mean data entities which have been derived from knowledge (aka ontologies)? But then they still have to have some kind of encoding which follows a schema, right? (and let's not start with the question when an explicit knowledge models (such as XML schema or an OWL file) can be called "ontologies" ;) )

"GATE developers and users find that it becomes difficult to understand..."
Do you have reference for that, or is that your personal experience?

"Foundational ontologies are often highly abstract and constraining. "
They are not. Ontologies are either highly abstract OR they are constraining (these two concepts are mutually exclusive). Foundational Ontologies like DOLCE are certainly not constraining, that would defeat their purpose.

"They are almost never adapted to business requirements"
That's because they are "foundational", they are not adapted for any particular domain. Don't confuse foundational- with domain-ontologies, and domain- with application-ontologies.

".. it is a heterogenous knowledge store designed and ..."
What kind of technology do you use here, something like Content Repository (e.g. compliant to the JCR-Standard?)

"Determine the structure between instances"
A bit difficult to understand, a few examples would have been good. What are the text mining instances you are talking about. A clear explanation of this concept is missing. Are the data entities the instances? Why should we then later assign text-based documents to these instances?

"... the user needs to study the data at hand ... .... is is important to include only those bits of text that are relevant and will not mislead the text-mining algorithm"
I see a lot of potential for error here. Let's take the following example: I have a service implementing an algorithm for modelling some physical phenomena, which was originally developed by some biologist. Will I be able to use his paper about the algorithm (containing all the key concepts required to understand this algorithm) for the ontology building process, or will I end up with an biology ontology (since in this paper an example from biology has been used to illustrate the algorithm)?
Why the focus on the actual source code? Having well-documented source code which could be used for text-mining is usually an exception.

"Create domain ontologies from feature vectors"
Does Ontogen expect feature vectors, or can I feed it with raw texts as well?

"this figure depicts three concepts"
I see three clusters of terms which are closely related, and three regions around these clusters telling me that these are apparently three concepts. What I don't see are the labels "Nominal Coreferencer", ... as you mention it in the text. I know that OntoGen can indicate what concepts exist by identifying clusters, but I don't think it can come up with the correct (?) labels for these regions.

".. the automatically acquried knowledge is post-edited using an external ontology editor ..."
Now I am interested in the evaluation, I hope you compare what is more time-consuming, creating an ontology from scratch by just looking at the service interface, or performing the semi-"automatic" approach

2.1.3 Service and content augmentation
"KCIT does ... semantic annotation .. and persistent storage and lookup of augmented content"
Later you explain that you use the PROTON ontology, which means (I think) that you store the annotations separately from the actual content. What exactly do you do? This is quite crucial, since you have to synchronize if you store it redundantly (and the source changes), or you have to keep your links up-to-date if you store them separatly (think of database normalization). You should explain your "keeping annotations synchronized" approach.

"Identify services and other content to be annotated"
This is now a bit confusing: we start with the actual source code to create our ontologies, and now we have to "find" the service which is the implementation or our source code to link it to the ontology?

"we assume that the Web services ... for a legacy application have been developed"
Isn't that the whole point of a legacy system? Something out there which has been and will be running for years? Sth. where the original creators have no intention to migrate it into the Semantic Web since it worked the last decade without it? If your are in the position to actually create Web services and Web service descriptions, you could integrate semantics from the start (without the need to later augment it)

"KCIT identifies key concepts intelligently (more than exact text match ... using NLP techniques"
It would be good if you could (a) provide an example which benefits from your NLP approach and (b) tell us in what sense this is "intelligent" (is there any reasoning involved, or are these mainly string manipulation techniques like stemming)?

"Storing and Querying Annotations"
So you store the exact position of the annotation in the WSDL document for your annotation? What is if the underlying service changes (e.g. a new operation is added, the metadata is updated)?

3 Evaluation
The evaluation is focussing on some of the components of the TAO infrastructure. That's good, but I am missing an actual evaluation of the problem you address in this paper. What (and where) are the GATE Web services you have annotated, are there any benefits (e.g. findability, not of forum discussions (which is great) but of the annotated services?

"we asked the GATE developers to refine it in order to create a golden standard"
The is the WRONG direction. You first create the golden standard from scratch, and then try to reach it with your software. Your "golden standard" is automatically biased, since you gave the developers already a template to start with. And then (after the refinement) you ask them for feedback of the quality, which of course is "encouraging".

"61.1% of the questions were answerable"
This sounds great, but I have no idea what you mean with "answerable". How can a domain ontology answer a question? What types of question where you looking at ("Where is ..?", "How can I ...?", "Help me...?").

Review 2 by Todd Pehle

The authors present the TAO (Transition Applications to Ontologies) methodology and tool set developed to ease the burden of transitioning legacy software applications towards semantically enabled services. The TAO methodology consists of: 1) knowledge acquisition (of APIs, source code, forums, documentation, etc.) 2) domain ontology learning and 3) service and content augmentation. The tool suite offers seemingly mature software tools to implement this methodology. The suite consists of a heterogeneous knowledge store for knowledge acquisition, LATINO and OntoGen for ontology learning and a Key Concept Identification Tool (KCIT) for service and content augmentation.

The following review guidelines for tools and systems were assessed:

Quality of the Tool:

The TAO tools were not downloaded and assessed during my review. Therefore I cannot judge the overall quality of the tool. However, the tools did appear to be open source and available for download. The TAO website also listed many demonstrations, documentation and several videos which is a great bonus. From a qualitative viewpoint, it does appear the tool is fairly mature and has been utilized in several different industry projects as mentioned in the paper. References to the tool go back several years. Overall, all tools were open source. However, I could not confirm how currently active each sub-project was during the review.

Importance and Impact of the Tool:

TAO could be useful along several axes. Manual ontology construction over a domain or set of data sources has been a long standing issue in the knowledge representation community. Using automated and semi-automated techniques to learn the knowledge structures of a constrained domain could greatly aid knowledge engineers. TAO could also be useful in aiding the myriad web services, Web APIs and APIs in general to become semantically enabled.

Regarding impact, my main question is how much differentiation does the tool provide over traditional functionality offered via a web service operation or traditional IR techniques? In industry I'm often faced with the questions like "Can't I do this just with Java?" and "Isn't using Lucene good enough?". The evaluations conducted in the paper suggest that there were 61.1% of questions that were answerable. Since a lot of the data sources had textual content (even within the semi-structured sources like source code), would a keyword index be able to retrieve all 100% of the questions albeit with perhaps a bit more work? For the 46.61% increase in time to use legacy system, is this enough utility to get a group to spend the time creating the domain ontology? I hope yes. I guess it would have been nice to see a deeper dive into the example ontology, which axioms were constructed automatically, how long it took to construct as well as what the limitations are to both the tools and the methodology. Sometimes an issue with a tool can outweigh any gains. What is the required expertise level to use LATINO, OntoGen and KCIT?

A metric I thought may have been missing was how much delta in user functionality was achieved due to semantic annotation of web services. If I understand correctly, from the learned domain ontology, TAO can enhance WSDLs with SA-WSDL annotations via GATE. I'd be curious to see how learned ontology concepts from LATINO and OntoGen performed in a semantic matchmaker or brokering service using SA-WSDL, OWL-S or WSMO. Granted, for brevity sake, it would be hard to include detailed explanations.

Clarity, Illustration, Readability of the paper:

The paper had good presentation. It also had several helpful illustrations and diagrams to help explain the methodology and tool. My main confusion point in the paper was over the term 'Semantic Web Services'. When I first read this I automatically thought the tool focused on enabling OWL-S and WSMO services. However, the paper largely focused on using the acquired domain ontology to search for terms "about" the GATE software API ("find forums about NLP") vs. performing a semantic query over GATE as a service ("find Persons in this document"). In the end, I believe TAO can help enable both capabilities, but it was slightly confusing initially.

Minor Comments:

- I liked the term "application mining". The term constrains the context to that of API-type software which seems to be TAO's primary focus.

- The term 'knowledge acquisition' in this context may mean 'data acquisition'; knowledge acquisition may sometimes refer to the extraction or population of an ontology in some contexts.

- In Section 3.1 the "Efficiency" bold headings overlap the right hand column. Perhaps this is just something wrong with my PDF version, but may we worth confirming.

Overall, the tool seems promising and useful. I'd like to understand the limitations better, though. It'd also be great to see the tool applied to "SEMANTIC WEB services" (Linked Data services) in addition to "semantic WEB SERVICES" (OWL-S,WSMO, etc.).