Bio-SODA - A Question Answering System for Domain Knowledge Graphs

Tracking #: 2601-3815

Authors: 
Ana Claudia Sima
Tarcisio Mendes de Farias
Maria Anisimova
Christophe Dessimoz
Marc Robinson-Rechavi
Erich Zbinden
Kurt Stockinger

Responsible editor: 
GQ Zhang

Submission type: 
Full Paper
Abstract: 
The problem of question answering over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are specifically targeted at open-domain question answering, and often cannot be applied directly in complex closed-domain settings of scientific datasets. In this paper, we focus on the specific challenges of question answering over closed-domain knowledge graphs and derive design goals for KGQA systems in this context. Moreover, we introduce our prototype implementation, Bio-SODA, a question answering system that does not require training data in the form of question-answer pairs for generating SPARQL queries over closed-domain KGs. Bio-SODA uses a generic graph-based approach for translating questions to a ranked list of candidate queries. Furthermore, we use a novel ranking algorithm that includes node centrality as a measure of relevance for candidate matches in relation to a user question. Our experiments with real-world datasets across several domains, including the last official closed-domain Question Answering over Linked Data (QALD) challenge – the QALD4 biomedical task – show that Bio-SODA outperforms generic KGQA systems available for testing in a closed-domain setting by increasing the F1-score by at least 20% across all datasets tested. We also provide a new bioinformatics benchmark with complex queries drafted in collaboration with domain experts. The experimental results show that for these types of real-world queries, the advantage of Bio-SODA is even more significant by outperforming state-of-the-art systems up to 46% improvement in the F1-score.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 29/Dec/2020
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
---------------------------------------
The paper studies the task of answering natural language questions (QA) over single-domain knowledge graphs (KGs), as opposed to that a majority of the QA research has been for open-domain KGs like DBpedia or Wikidata. The presented system Bio-SODA (1) requires no question-answer pairs for training and relies solely on data from the KG in question, thus suitable for domains where training data or knowledge sources are absent or rare; (2) works for KGs integrating multiple datasets and handles multi-hop questions with many triple patterns in a query; and (3) increments well when the KG is updated or new datasets added. The evaluation is conducted on three real-world, large-scaled KGs: the public QALD4 biomedical task involving three datasets for drugs, diseases, and side effects, a bioinformatics KG involving two datasets for orthology and gene expression, and a KG for EU projects. The questions used for test are either from the QALD challenge or devised by domain experts. As shown in Table 4 in the paper, Bio-SODA achieved a score of 0.6 in precision, recall and F1-measure in all three KGs. This can be seen as an impressive result for a fully automatic system that uses almost none external sources.

What bothers me the most is the contradiction in the rationale of the work, namely developing a domain-independent, generic approach for targeting domain KGs. Let me elaborate on this problem as follows.
- Independent of any extra domain knowledge resources, Bio-SODA is meant to be applicable to all kinds of KGs. Then, this should include open-domain KGs as well like DBpedia. I therefore suggest to add such experiments. Of course, if this is the case, the whole tone of the paper would be changed to something like “A generic QA system for KGs without training QA pairs”.
- If the authors stick with closed-domain KGs, then it's important to make clear the vital differences between open-domain and closed-domain KGs so that respective systems can be developed. Section 2 in the paper is dedicated to list and analyze the challenges of QA over domain KGs. However, it seems to me that most of them exist for open-domain KGs as well, including “Rule-based approaches perform well, but are costly to build and maintain”, “Schema-less, incomplete data with imprecise labels”, “Integration with external datasets” and so on, which can be all spotted in KGs like DBpedia. As a consequence, the design goals presented in Section 6.1 should be expected for QA tools on open KGs as well, rather than unique for domain KGs. Accordingly, listing these challenges cannot be accounted for a contribution of the paper at the end of Section “Introduction”.
- The “Specialized domain vocabulary and terminology” listed in Section 2 is indeed the thing that distinguishes domain from open KGs. This also causes much more domain terminologies and ontologies to be developed than the commonsense knowledge, as shown in the bioinformatics domain. It seems to me that a QA system particularly over domain KGs should take advantage of these resources. Take bioinformatics for instance which is the main test domain in the paper. The UMLS is so far the most comprehensive biomedical terminology system integrating more than a hundred of kinds of terminologies and ontologies. I do not see why not to use the UMLS (or part of it) including its rich synonyms and the NLP tool of UMLS Lexicon to facilitate the translation from natural language questions into SPARQL queries over biomedical KGs. This can help alleviate the “incomplete information” problem stated in Section 5.5 “Error Analysis”, which is an inherent problem of large-scale KGs. Solely relying on data within the KG makes it hard for Bio-SODA to perform on top, as a score of 0.6 in precision and recall cannot be satisfactory for any practical QA system. BTW, according to Table 4 where Bio-SODA performs at medium level even when in comparison with at most three systems, claims like “Bio-SODA outperforms generic KGQA systems available for testing in a closed-domain setting by increasing the F1-score by at least 20% across all datasets tested” in the Abstract and so on are misleading and shall be rephrased to reflect the ordinary performance of the system.

There are some other problems about the paper and I list them as follows.
- One of the contributions declared at the end of Section “Introduction” is a novel ranking algorithm for selecting the best SPARQL candidate. It is the first to combine measures of string and semantic similarity and node importance. The semantic similarity is computed based on a pre-trained word embedding from the biomedical research article repository PubMed, and this seems to be the only external resource Bio-SODA uses. It's good to combine various measures and use external domain resources, and in this regard I suggest to add some structural similarity to measure the commonalty of nodes in their links to other nodes. Moreover, the node importance measure decides that those linked with more nodes, like centers, are more likely to be selected, for example, “lung” representing anatomical entity has a higher centrality than its other senses. Using such node importance may cause the less used cases of terms to always rank low and thus never selected, as shown by an error in Section 5.5 when “lung” means property of a gene rather than the organ in a question. Such a down side of node centrality measure should be dealt with or at least discussed in the paper.
- Bio-SODA can handle KGs consisting of multiple datasets that are connected by owl:sameAs statements. This is an advantage of Bio-SODA, whereas what I do not agree is that such cases are called “redundancy” all through the paper and treated as a problem. It is common that different datasets covering the same domain specify the same entities from their own perspectives, and I think this should be preprocessed by some alignment/merging technique so as to explicitly identify the overlap as well as the difference, and this can be done before QA gets started. In other words, the so-called “redundancy” shall be dealt with in a systematic way in Bio-SODA.

Some small points are as follows.
- The user of Bio-SODA is declared several times in the paper to be domain experts. I don’t see why ordinary people cannot query a domain KG and what difference this would yield to the design of the QA system.
- For every algorithm like Algorithm 1, the complexity as well as termination should be analyzed.
- In the Abstract, the authors declare that “We also provide a new bioinformatics benchmark with complex queries drafted in collaboration with domain experts.” The fact is that that two sets of 30 queries are devised with domain experts and tested over the bioinformatics KG and EU projects KG, respectively. There is no evidence whatsoever in the paper to show that these queries are qualified as a new benchmark.

In terms of the writing, the paper is clearly written and easy to follow. The content organization is generally good, except for the following two points.
- Section 3 titled “Question Answering Pipeline” actually gives an example to illustrate the steps in the process of Bio-SODA. None pipeline is explicitly presented, and it seems that the part before Figure 4 in Section 4 “System Architecture” is more about the content of the pipeline. Thus I suggest to rename Section 3 as “An Illustrating Example”.
- Section 6 titled “Lessons Learned” consists of only one subsection 6.1, which is about “Design Goals for … ”, and there is no discussion about lessons.

Overall, the presented generic, domain-independent system does not show significant results over domain KGs, and the originality of the approach is not very strong. I suggest Major Revision, either to pursue effectiveness in all kinds of KGs including open ones, or to add domain knowledge resources to enhance the system in its current goal.

Review #2
Anonymous submitted on 30/Dec/2020
Suggestion:
Major Revision
Review Comment:

Researchers of this manuscript discuss the design and evaluation of a question-answering system for bioinformatics knowledge graphs. Compared to other systems (open or closed domain), it leverages PageRank network metric to rank the triple patterns, which are then used to formulate the SPARQL query. Other than that, the other aspects are no different than what has already been done in other QA systems for ontologies and KG. They tested their system on the 3 KG datasets - QALD4, Bioinformatics (Bgee and OMIA), and Cordis. I can't speak on the choice of datasets since I am not familiar with them. However, QALD4 comes with their own set of test questions and researchers produced their own for the other two. The results of their experimental tests showed "ok" performance with the QALD4 dataset (0.60 F1 score compared to 0.99 of the highest), but fared better in F1 scoring with the Bioinformatics and Cordis datasets. On average, 0.62 F1 score for the total tested datasets. While I appreciate the work of this study and share authors' sentiment that their approach is explainable and transparent, there are numerous questionable aspects ranging from evaluation strategy, design issues, manuscript related issues, etc. Details and questions below (note that I use ontologies and KG interchangeably for convenience):

* One very confusing feature of this work is whether it is open-domain or close-domain system, or perhaps the authors have not defined what would constitute open or closed domain QA system for ontologies. In one passage, "a question answering system for closed domain... as well as updates in the existing data sources" sounds antithetical to what a close-domain KG QA system supposed to be. So what exactly makes Bio-SODA closed domain system? Is the tool "hard coded" for a handful of bio-related ontologies? Furthermore, the "Generality" point made at Lesson Learned adds to the confusion.

* The related studies or the examination of other QA systems for biohealth KG is lacking. I am very certain, having researched the topic myself, the authors are missing other QA systems for biomedical ontologies. In light of other QA system for biohealth KG, why do we need others? I recommend overhauling the related studies with a systematic examination of those systems.

* Many biohealth KG utilize Basic Formal Ontologies (BFO), and is almost a standard for biohealth ontologies to have alignment to BFO (See the BioFoundry project). How does this factor in or affect your approach or design?

* How does your system accommodate questions that are worded differently and/or use synonymous terms? For example, the sample question "What are the drugs for diseases associated with the BRCA genes?"... what if they use the word "correlated" or if the question is reworded but expressed the same inquiry?

* What is the "Summary Graph"? Is it a derivation of the target KB? Further clarification is needed to understand it role.

* I admire the use of PageRank to support the function of the system. However, why PageRank? I recall the authors referenced it a centrality metric for graphs, but essentially it is just a different method to compute average in-degree from node ties. With that said, what about "plain" in-degree computation or using other in-degree and centrality metrics like Katz, eigenvector, closeness, or betweenness? ** I'll admit being picky here as PageRank seems to be over-hyped and over-used in the CS world **

* Regarding PageRank, it was mentioned "As an added benefit, scoring with PageRank also ensures that metadata matches are prioritized. For example, Drug as a class will rank higher than an instance match". Interesting notion, but why prioritize class level over instance level?

* The subsection Query Graph Constructor Module is very unclear. I would advise reworking it to make it clearer for the reader. Also there is a lack of clarification about the custom rules and handcrafted procedures. Brief explanation of the Stiener Tree problem is needed.

* Figure 4 needs some work. Does each of modules execute sequentially or at the same time? I would advise looking at the UML notations to help improve the figure. Also the Query Execution Module is missing.

* The authors mentions the number of triple patterns implying some importance to the study. How much value or influence does the number of triple patterns have on this work? If not, I don't think it is worth mentioning.

* While QALD4 comes with available questions to test, the other datasets had individuals develop the test questions. Who were the individuals and what was the process involved in producing these questions? If co-authors, it should be mentioned. This also brings me to question the rigor of the evaluation strategy. Since the test questions were developed for Bioinformatics and Cordis, how did you verify the results? What was the procedure involved in determining correct answers and who determined it? and so.

* "Precision@1" is this short hand for something?

* While the authors discuss the impact of the ranking algorithm in "Impact of Ranking Algorithm", it is not mentioned in the "Lesson Learned" section.

* In the "Conclusion and Outlook", I would slightly downplay the results as it is mediocre at best (See comments above). However, there are some summarized takeaways you can mention. Also the future direction content seemed rushed and disorganized.