Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
---------------------------------------
The paper studies the task of answering natural language questions (QA) over single-domain knowledge graphs (KGs), as opposed to that a majority of the QA research has been for open-domain KGs like DBpedia or Wikidata. The presented system Bio-SODA (1) requires no question-answer pairs for training and relies solely on data from the KG in question, thus suitable for domains where training data or knowledge sources are absent or rare; (2) works for KGs integrating multiple datasets and handles multi-hop questions with many triple patterns in a query; and (3) increments well when the KG is updated or new datasets added. The evaluation is conducted on three real-world, large-scaled KGs: the public QALD4 biomedical task involving three datasets for drugs, diseases, and side effects, a bioinformatics KG involving two datasets for orthology and gene expression, and a KG for EU projects. The questions used for test are either from the QALD challenge or devised by domain experts. As shown in Table 4 in the paper, Bio-SODA achieved a score of 0.6 in precision, recall and F1-measure in all three KGs. This can be seen as an impressive result for a fully automatic system that uses almost none external sources.
What bothers me the most is the contradiction in the rationale of the work, namely developing a domain-independent, generic approach for targeting domain KGs. Let me elaborate on this problem as follows.
- Independent of any extra domain knowledge resources, Bio-SODA is meant to be applicable to all kinds of KGs. Then, this should include open-domain KGs as well like DBpedia. I therefore suggest to add such experiments. Of course, if this is the case, the whole tone of the paper would be changed to something like “A generic QA system for KGs without training QA pairs”.
- If the authors stick with closed-domain KGs, then it's important to make clear the vital differences between open-domain and closed-domain KGs so that respective systems can be developed. Section 2 in the paper is dedicated to list and analyze the challenges of QA over domain KGs. However, it seems to me that most of them exist for open-domain KGs as well, including “Rule-based approaches perform well, but are costly to build and maintain”, “Schema-less, incomplete data with imprecise labels”, “Integration with external datasets” and so on, which can be all spotted in KGs like DBpedia. As a consequence, the design goals presented in Section 6.1 should be expected for QA tools on open KGs as well, rather than unique for domain KGs. Accordingly, listing these challenges cannot be accounted for a contribution of the paper at the end of Section “Introduction”.
- The “Specialized domain vocabulary and terminology” listed in Section 2 is indeed the thing that distinguishes domain from open KGs. This also causes much more domain terminologies and ontologies to be developed than the commonsense knowledge, as shown in the bioinformatics domain. It seems to me that a QA system particularly over domain KGs should take advantage of these resources. Take bioinformatics for instance which is the main test domain in the paper. The UMLS is so far the most comprehensive biomedical terminology system integrating more than a hundred of kinds of terminologies and ontologies. I do not see why not to use the UMLS (or part of it) including its rich synonyms and the NLP tool of UMLS Lexicon to facilitate the translation from natural language questions into SPARQL queries over biomedical KGs. This can help alleviate the “incomplete information” problem stated in Section 5.5 “Error Analysis”, which is an inherent problem of large-scale KGs. Solely relying on data within the KG makes it hard for Bio-SODA to perform on top, as a score of 0.6 in precision and recall cannot be satisfactory for any practical QA system. BTW, according to Table 4 where Bio-SODA performs at medium level even when in comparison with at most three systems, claims like “Bio-SODA outperforms generic KGQA systems available for testing in a closed-domain setting by increasing the F1-score by at least 20% across all datasets tested” in the Abstract and so on are misleading and shall be rephrased to reflect the ordinary performance of the system.
There are some other problems about the paper and I list them as follows.
- One of the contributions declared at the end of Section “Introduction” is a novel ranking algorithm for selecting the best SPARQL candidate. It is the first to combine measures of string and semantic similarity and node importance. The semantic similarity is computed based on a pre-trained word embedding from the biomedical research article repository PubMed, and this seems to be the only external resource Bio-SODA uses. It's good to combine various measures and use external domain resources, and in this regard I suggest to add some structural similarity to measure the commonalty of nodes in their links to other nodes. Moreover, the node importance measure decides that those linked with more nodes, like centers, are more likely to be selected, for example, “lung” representing anatomical entity has a higher centrality than its other senses. Using such node importance may cause the less used cases of terms to always rank low and thus never selected, as shown by an error in Section 5.5 when “lung” means property of a gene rather than the organ in a question. Such a down side of node centrality measure should be dealt with or at least discussed in the paper.
- Bio-SODA can handle KGs consisting of multiple datasets that are connected by owl:sameAs statements. This is an advantage of Bio-SODA, whereas what I do not agree is that such cases are called “redundancy” all through the paper and treated as a problem. It is common that different datasets covering the same domain specify the same entities from their own perspectives, and I think this should be preprocessed by some alignment/merging technique so as to explicitly identify the overlap as well as the difference, and this can be done before QA gets started. In other words, the so-called “redundancy” shall be dealt with in a systematic way in Bio-SODA.
Some small points are as follows.
- The user of Bio-SODA is declared several times in the paper to be domain experts. I don’t see why ordinary people cannot query a domain KG and what difference this would yield to the design of the QA system.
- For every algorithm like Algorithm 1, the complexity as well as termination should be analyzed.
- In the Abstract, the authors declare that “We also provide a new bioinformatics benchmark with complex queries drafted in collaboration with domain experts.” The fact is that that two sets of 30 queries are devised with domain experts and tested over the bioinformatics KG and EU projects KG, respectively. There is no evidence whatsoever in the paper to show that these queries are qualified as a new benchmark.
In terms of the writing, the paper is clearly written and easy to follow. The content organization is generally good, except for the following two points.
- Section 3 titled “Question Answering Pipeline” actually gives an example to illustrate the steps in the process of Bio-SODA. None pipeline is explicitly presented, and it seems that the part before Figure 4 in Section 4 “System Architecture” is more about the content of the pipeline. Thus I suggest to rename Section 3 as “An Illustrating Example”.
- Section 6 titled “Lessons Learned” consists of only one subsection 6.1, which is about “Design Goals for … ”, and there is no discussion about lessons.
Overall, the presented generic, domain-independent system does not show significant results over domain KGs, and the originality of the approach is not very strong. I suggest Major Revision, either to pursue effectiveness in all kinds of KGs including open ones, or to add domain knowledge resources to enhance the system in its current goal.
|