A linguistic method for extracting production rules from scientific texts: Evaluation in the specialty of gynecology

Tracking #: 1782-2995

Authors: 
Amina Boufrida
Zizette Boufaida

Responsible editor: 
Philipp Cimiano

Submission type: 
Full Paper
Abstract: 
Due to the considerable increase in freely available data (especially on the Web), extracting relevant information from textual content is a critical challenge. Most of the available data is embedded in unstructured texts and is not linked to formalized knowledge structures such as ontologies or rules. A potential solution to this problem is to acquire such knowledge through natural language processing (NLP) tools and text mining techniques. Prior work has focused on the automatic extraction of ontologies from texts, but the acquired knowledge is generally limited to simple hierarchies of terms. This paper presents a polyvalent framework for acquiring complex relationships from texts and coding these in the form of rules. Our approach begins with existing domain knowledge represented as an OWL ontology and applies NLP tools and text matching techniques to deduce different atoms, such as classes and properties, to capture deductive knowledge in the form of new rules. We evaluated our approach by applying it in the medical field, specifically, the specialty of gynecology, showing that our approach can automatically and accurately generate SWRL rules for the representation of the more formal knowledge that is necessary for reasoning.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 08/Jan/2018
Suggestion:
Reject
Review Comment:

The paper describes a method to derive formal rules (in SWRL format) from unstructured textual inputs. The authors use a combination of linguistic techniques and tools to identify the body and the head of the rules, associate them to classes in an ontology, classify them and construct the rule triples that define SWRL rules.

I have several concerns on the paper:
-I found the review of the state of the art outdated and partial. First, the authors argue that there are no works (other than the one by Hassapour et al.) that tackled the problem of deriving non-taxonomical rules from text. They justify this by citing “current works” (from 2001-2003!) that only focused on extracting taxonomic relationships, while only some “recent works” (2007-2009) attempted to derive logical relationships. I find this hardly convincing, since much of the work on knowledge acquisition conducted in the last 10 years focused on non-taxonomic relationships (including ontological properties and attributes, OWL restrictions and axioms), such as:

Learning non-taxonomic relationships from web documents for domain ontology construction. Data Knowl. Eng. 64(3): 600-623 (2008)

A methodology to learn ontological attributes from the Web. Data Knowl. Eng. 69(6): 573-597 (2010)

Learning expressive ontologies. In Conference on ontology learning and population (pp. 45–69). Germany.

Acquisition of OWL DL axioms from lexical resources. In 4th European conference on the Semantic Web (pp. 670–685).

Fostering Web intelligence by semi-automatic OWL ontology refinement. In 7th International conference on Web intelligence (pp. 454–460).

Learning relation axioms from text: An automatic Web-based approach. Expert Syst. Appl. 39(5): 5792-5805 (2012)

-The approach employed by the authors follows a pretty standard approach to derive the rules, which include the use of several available techniques and tools. As such, it sounds more like a use case than a real innovative knowledge acquisition methodology. In this sense, I would suggest the authors removing or re-arranging the background details they given on ontologies, ontology languages and linguistic tools (Section 1.1, 1.2, and many of the content of Section 3.1).

-The approach is purely linguistic, which implies that it receives a single textual input and outputs rule that should hold, because is meant to be integrated within a domain ontology. However, many issues would make such rules inappropriate, such as language ambiguity, noise or just incorrect or decontextualized statements. To mitigate these problems, most knowledge acquisition and ontology learning methodologies (such as those reference above), employ a statistical assessor, which benefits from the huge amount of electronic text currently available in the web, to filter out only the most robust statements.

-Even though the approach by the authors uses general-purpose tools, it is very biased toward a specific domain of knowledge (that they employ as use case and as evaluation benchmark). Since this is a medical domain and they use high quality textual inputs, this eases the problem of automatic knowledge acquisition, but also hampers the generality of the approach. How hard would be for the practitioners to apply the method to other domains of knowledge? In this sense, the work by the authors seems related to works deriving formal medical knowledge from clinical practice guidelines, which may be worth-looking.

-The evaluation is hardly convincing, since it employs a very small data set and does not give evidences of objectiveness (I assume that the authors themselves manually rated each output by the system). Moreover, no comparisons with related works are included.

Review #2
Anonymous submitted on 28/Feb/2018
Suggestion:
Reject
Review Comment:

SHORT SUMMARY
-------------

The paper provides an architecture for extracting SWRL rules from text, and a description of the implementation of this architecture specific for the case of gynaecology. As far as I understood from the paper the steps foreseen in the architecture are:
1) the text in extracted from relevant sources and manually decomposed in relevant textual sentences.
2) An existing NLP pipeline/tool is exploited to tokenize the sentence and assign each component their relevant grammatical category and lemma;
3) An original matching algorithm is exploited to match the relevant tokens (roughly nouns and verbs) into an existing manually built ad hoc ontology, thus producing a list of relevant ontology terms T=, where each t_i is either a concept, an individual or an object property.
4) An original algorithm is proposed to convert all possible pairs obtained by computing the cartesian product TxT, into RDF triples (where the missing entity in the triple is retrieved by exploiting the relation existing between t_i and t_j in the ontology (ignoring is_a and instance_of relations).
5) An original rule (pattern) is adopted to convert the triples obtained into SWRL rules.

GENERAL COMMENT
---------------

The paper tackles an interesting and original problem, namely the extraction of SWRL rules from text. The problem is definitely relevant to the semantic web and knowledge engineering community, and as such it is definitely relevant for the SWJ. As far as I know this is still an open problem and - differently from ontology population or ontology axiom learning - not much work has been devoted to this.
Nonetheless I have several concerns on the (a) generality of the proposed work beyond the specific case illustrated here, (2) some of the choices made to build the architecture, and (3) its evaluation, which definitely hampers - in my view - the significance of the proposed work.

Moreover, the paper would need to be substantially rewritten to add explanations, comments, and details.

SPECIFIC COMMENTS
-----------------

GENERALITY/SIGNIFICANCE
-----------------------
The authors claim that the paper presents a (general) linguistic method for extracting production rules from scientific texts and an evaluation in the speciality of gynaecology. Unfortunately what the paper describes, besides a very high level pipeline, is only the specific implementation of the system in the speciality of gynaecology. A discussion of what are the scope, boundaries, requirements to adapt / exploit the system in a different scenario is completely missing, together with an investigation of the generality of the approach.
To be a scientific contribution as a research paper, the paper should describe:
- the characteristics of the type of text required by the system, with convincing evidence that these are characteristics commonly found in scientific text (note, I'm not saying they are not. Just that they should be provided in the paper).
- the characteristics of the pre-processed text. That is, what is the specific input accepted by the system.
- the role and importance of the ontology. This is, to me, one of the critical points of your solution, when it comes to its generality and applicability. What are the characteristics of your ontology? It seems it has to cover exactly the concepts (roles, and individuals) described in the text for the pipeline to work. I find this a *huge* requirement which should be carefully discussed. How much does your system depend upon a well crafted ontology which covers exactly (modulo synonyms) the terms in the text? If this is the case, what are the impacts on adopting your solution in a different context?
The reason why this point bothers me is that usually the extraction from text happens exactly because people do not have knowledge already formalised and one could argue that if an expressive ontology needs to be built from the text from which rules are extracted then also the rules can be manually extracted together with the ontology.
Is it a realistic scenario the one of using pre-existing (e.g., medical) ontologies for the task? What would this reflect upon the performances of your system? As an alternative, are there methods of ontology construction that would build a reasonable ontology from your input language so that the pipeline can be further automatised?
These are, to me, fundamental questions that would need to be discussed and quantified in proposing the approach. Just mentioning that an ontology is required to have the system working is too vague.
- My last point on the generality of the approach concerns the claim of *extraction from text*. This is somehow related to the above and the role of the ontology. By reading the paper it seems to me that you are considering only terms, for instance a noun, that is matched to something in the ontology. Indeed, if a noun would appear in the text but was not matched to an ontology term is would be discarded. This, to me, is not extraction of rules from text... it's selecting, among all possible rules that I can already build starting from the ontology (and your pattern) the ones that actually occur in the text. A more general approach would be, to me, to build a set of RDF rules starting from the text - exploiting some of the tools that already exist (see e.g., FRED - http://www.istc.cnr.it/it/news/fred-and-tipalo-natural-language-rdfowl, or PIKES - https://knowledgestore2.fbk.eu/pikes-demo/ , or even ReVerb - http://reverb.cs.washington.edu) and then use your idea of converting RDF triples to SWRL.
How would this compare to your approach in terms of performances and generality?
If you strongly believe that an ontology based approach is the best (and my guess it that it should pay off) you should really investigate your hypothesis and find out which are the advantages of doing things as you propose - wrt a more general pipeline which does not make use of the ontology.

EVALUATION
----------
I found the evaluation rather narrow, in different ways.
- 12 cases (I assume sentences) are definitely too few for the current evaluation standards. No description of these sentences, their variability and their relation (coverage) wrt to the ontology is provided. This makes the numbers reported in the evaluation not particularly significant to me.
Also, the fact that the system was able to deal with basically all sentences apart from datatype properties and that 75 out of 82 terms were relevant (I assume that by relevant you mean present in the ontology), seem to indicate a kind of overfitting wrt the knowledge already embedded in the system.
I believe that a more realistic evaluation should be planned to prove the claim that the system has good performances.
Also, as already mentioned, an evaluation wrt a baseline that produces SWRL rules from RDF triples directly extracted from text would be interested.
Also, would it be viable to compare with previous work such as the one mentioned Hassanpour et al. (2014)?
Indeed using your system with the datasets of Hassanpour et al. (2014) - if available - and compare the results could be an interesting experiment to prove that you are better than the current baseline.
Finally, a description of the test set (pairs text, rule) should be provided to the reviewers together with a description of the process of creating it.

RELATED WORK
------------
The paper claims that extraction of ontological knowledge is mainly focused on is_a relations. While this is partially true there is a nice body of work devoted to the extraction of expressive axioms that should be mentioned. See e.g, the work of Johanna VÖLKER et al

@book{Volker:2009:LEO:1696519,
author = {Volker, J.},
title = {Learning Expressive Ontologies: Volume 2 Studies on the Semantic Web},
year = {2009},
isbn = {1607500337, 9781607500339},
publisher = {IOS Press},
address = {Amsterdam, The Netherlands, The Netherlands},
}

@inproceedings{Volker:2007:AOD:1419662.1419722,
author = {V\"{o}lker, Johanna and Hitzler, Pascal and Cimiano, Philipp},
title = {Acquisition of OWL DL Axioms from Lexical Resources},
booktitle = {Proceedings of the 4th European Conference on The Semantic Web: Research and Applications},
series = {ESWC '07},
year = {2007},
isbn = {978-3-540-72666-1},
location = {Innsbruck, Austria},
pages = {670--685},
numpages = {16},
url = {http://dx.doi.org/10.1007/978-3-540-72667-8_47},
doi = {10.1007/978-3-540-72667-8_47},
acmid = {1419722},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
}

or further work from the same author and more recent attempts such as the ones in

@inproceedings{DBLP:conf/ekaw/PetrucciGR16,
Author = {Giulio Petrucci and Chiara Ghidini and Marco Rospocher},
Bibsource = {dblp computer science bibliography, http://dblp.org},
Biburl = {http://dblp.uni-trier.de/rec/bib/conf/ekaw/PetrucciGR16},
Booktitle = {Knowledge Engineering and Knowledge Management - 20th International Conference, {EKAW} 2016, Bologna, Italy, November 19-23, 2016, Proceedings},
Crossref = {DBLP:conf/ekaw/2016},
Date-Added = {2017-10-06 10:30:14 +0000},
Date-Modified = {2017-10-06 10:30:14 +0000},
Doi = {10.1007/978-3-319-49004-5_31},
Pages = {480--495},
Timestamp = {Tue, 23 May 2017 01:06:49 +0200},
Title = {Ontology Learning in the Deep},
Url = {https://doi.org/10.1007/978-3-319-49004-5_31},
Year = {2016},
Bdsk-Url-1 = {https://doi.org/10.1007/978-3-319-49004-5_31},
Bdsk-Url-2 = {http://dx.doi.org/10.1007/978-3-319-49004-5_31}}

In particular I believe that the work of VÖLKER et al should be compared with the approach presented in this paper as the rule to go from RDF to SWRL could be seen as a kind of pattern.

REPRODUCIBILITY
---------------
Neither the initial text, the ontology, the text set and the tools are available.

MINOR COMMENTS
--------------

- I suggest to move the comparative illustration of related work with your system after the presentation of your system.
- page 9, column, 1. A reference to ANNIE should be added.
- page 11, end of column 1. I found the text starting from "The objective of this rule is to annotate," not fully comprehensible. Where are the lines the text refers to?
- tables, figures and algorithms should be explained in the text. In particular the triple construction algorithm and all the figures in section 4.


Comments