Querying Biomedical Linked Data with Natural Language Questions

Tracking #: 1216-2428

Thierry Hamon
Natalia Grabar
Fleur Mougin

Responsible editor: 
Guest Editors Question Answering Linked Data

Submission type: 
Full Paper
Recent and intensive research in the biomedical area enabled to accumulate and disseminate biomedical knowledge through various knowledge bases increasingly available on the Web. The exploitation of this knowledge requires to create links between these bases and to use them jointly. Linked Data, SPARQL language and interfaces in natural language question answering provide interesting solutions for querying such knowledge bases. However, while using biomedical Linked Data is crucial, life-science researchers may have difficulties using the SPARQL language. Interfaces based on natural language question answering are recognized to be suitable for querying knowledge bases. In this paper, we propose a method for translating natural language questions into SPARQL queries. We use Natural Language Processing tools, semantic resources and RDF triple description. We designed a four-step method which allows to linguistically and semantically annotate questions, to perform an abstraction of these questions, then to build a representation of the SPARQL queries, and finally to generate the queries. The method is designed on 50 questions over three biomedical knowledge bases used in the task 2 of the QALD-4 challenge framework and evaluated on 27 new questions. It achieves good performance with 0.78 F-measure on the test set. The method for translating questions into SPARQL queries is implemented as a Perl module and is available at http://search.cpan.org/~thhamon/RDF-NLP-SPARQLQuery/.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Mariana Neves submitted on 13/Nov/2015
Review Comment:

I have no further remarks.

Review #2
By Anca Marginean submitted on 10/Dec/2015
Minor Revision
Review Comment:

In my opinion, the paper still needs some improvements.

Some issues included in the previous reviews still require attention. For example, the subsections of Section 2. I think that all SPARQL queries are built with the goal of querying linked data, so the separation proposed for Related works in subsections 2.1 and 2.2 is unjustified. Another example is the description of the contextual rewriting rules.

Minor issues:
Page 2: "...,while semantic entities refer to the entities provided the semantic resources" - it should be "provided by"?
Page 3: "Moreover, [1] aims at translating natural language questions into SPARQL queries" - I think that "moreover" is not necessary here, since all the papers previously presented in that section are about the same subject.
Page 4: "Thus, in comparison with the closest work [32], our method makes closer links between NLP and Linked Data." - what does it mean closer links?
Page 4: "detection of named entities [14]", where 14 is a work about an approach for Entity Boundary Detection for Korean Text based on SVM. As I understand, in the current paper, the term Named entity recognition (fig 2) denotes the identification of numbers and solubilities with regular expressions, so quite different from [14].
Page 5: in fig 2. it is mentioned Sentence segmentation. Since the input of the system is a query, is it necessary? And if yes, is it done after Word segmentation?

Review #3
By Jin-Dong Kim submitted on 12/Feb/2016
Review Comment:

After reading through the revised manuscript, I have found that the authors' reasonably well addressed my previous comments. I have no further comment, and recommend it to be published in the special issue.

Review #4
By Christina Unger submitted on 24/Feb/2016
Minor Revision
Review Comment:

I'm still struggling with your distinction in 2.1 and 2.2. I agree that things get more complex when, e.g., different KBs need to be queried and there has to be some additional federation step. But this is not the case for the papers you cite ([22] and [32]) -- they translate a question into a SPARQL query and the step of querying the KB simply consists of sending the constructed query to the endpoint. This could equally easily added to the systems you mention in 2.1, so I don't see any principled difference. Work where this distinction does make sense, I think, is PowerAqua and RADAR (a paper in this issue: http://www.semantic-web-journal.net/content/radar-information-reconcilia...), where reconciliation of information or answers is required.

Besides that, just some remaining typos:

Throughout the paper:

* You sometimes write "linked data" and sometimes "Linked Data". Please keep this consistent.
* You very often use the expression "associate together" as a transitive verb, I think this is ungrammatical. Using "join" instead would work better in most cases.


* Linked Data, SPARQL language and --> Linked Data, the SPARQL language and
* RDF triple description --> RDF triple descriptions

Page 2:

* remaining of this work --> remainder of this paper
* You have three sentences with brackets, where I would not use brackets, because the information is important:
- three ways are identified for querying Linked Data (knowledge-based specific interface, Graphical Query Builder and question answering system)
- the types of Linked Data which are processed (typically, general [5,16,32] or specialized [1] KBs)
- a multilingual toolkit (called Grammatical Framework)
* string similarity computing --> string similarity computation

Page 3:

* recently, the Question Answering over Linked Data (QALD-4) challenge proposes --> recently, the Question Answering over Linked Data (QALD-4) challenge proposed
* querying of linked data with the natural language interfaces --> querying of linked data with natural language interfaces
* Footnote 3: I would put the name of the task in the main text, as it's important information.

Page 7:

* we also extract terms, which usually correspond to noun phrases relevant for the targeted domain, from --> we also extract terms which usually correspond to noun phrases relevant for the target domain from
* when its context contains semantic entity --> when its context contains a semantic entity
* and to adjust (modification or deletion) semantic types --> and to adjust (modify or delete) semantic types

Page 8:

* Additionally [...] was also used --> Additionally [...] was used
* In some places you capitalize "result" in "result form", "question" in "question topic", and "predicate" or "argument"; I'm not sure why.
* applied on the object --> applied to the object

Page 10:

* can be URI --> can be URIs

Page 11:

* represented as regular expression --> represented as a regular expression

Page 13:

* Definition of the Semantic Resources --> Description of the Semantic Resources
* diseases and genes linked among them by --> diseases and genes linked by


* In [10], there's a problem with the name of the second author.