MuHeQA: Zero-shot Question Answering over Multiple and Heterogeneous Knowledge Bases

Tracking #: 3302-4516

Authors: 
Carlos Badenes-Olmedo
Oscar Corcho1

Responsible editor: 
Guest Editors Interactive SW 2022

Submission type: 
Full Paper
Abstract: 
There are two main limitations in most of the existing Knowledge Graph Question Answering (KGQA) algorithms. First, the approaches depend heavily on the structure and cannot be easily adapted to other KGs. Second, the availability and amount of additional domain-specific data in structured or unstructured formats has also proven to be critical in many of these systems. Such dependencies limit the applicability of KGQA systems and make their adoption difficult. We propose a novel algorithm, MuHeQA, that alleviates both limitations by retrieving the answer from textual content automatically generated from KGs instead of queries over them. This new approach (1) works on one or several KGs simultaneously, (2) does not require training data what makes it is domain-independent, (3) enables the combination of knowledge graphs with unstructured information sources to build the answer, and (4) reduces the dependency on the underlying schema since it does not navigate through structured content but only reads property values. MuHeQA extracts answers from textual summaries created by combining information related to the question from multiple knowledge bases, be them structured or not. Experiments over Wikidata and DBpedia show that our approach achieves comparable performance to other approaches in single-fact questions while being domain and KG independent. Results raise important questions for future work about how the textual content that can be created from knowledge graphs enables answer extraction.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 04/Jan/2023
Suggestion:
Minor Revision
Review Comment:

The proposed method provides one or more natural language answers to questions posed in plain language. A distinct strength of the algorithm is providing evidence along with the answer. The evidence takes the form of a sentence from which the answer was extracted associated with a confidence value.

The manuscript is overall solidly written with only a few typos (especially related to punctuation), e.g., on page 10 lines 40/41. page 11, line 25,...
However, more details could be provided for the approach taken. The description for summarization on lines 4-5 on page 5 is too vague. Also, I could not easily find what POS tagger/Sentence splitter did the authors use and how reliable these components are. Looking into the code, it appears to me that what is referred to as a "sentence" in the article may not always be a sentence in the grammatical sense as it seems to be extracted using the function "property_to_text", which returns the entire object? I have similar comments towards other steps in the pipeline. The authors should strongly consider adding at least one pseudocode listing for each pipeline step.

For the keyword identification task, the authors should clarify the setup of the baseline approaches. In particular, the reference in Table 1 is for general "BERT" (Devlin et al) [16], but earlier, the authors also reference [19]. The comparatively poor performance of BERT-NER on SimpleQuestions should be commented on/analyzed.

The authors should state the size of the evaluation datasets (Tables 1-5).

The authors could in more detail describe the role of the websearchentities/Lookup service linking. How many candidates are retrieved? Where are these approaches reflected in the evaluation tables?

Reference linking systems (spaCy and DBpedia Spotlight) are used. While these are commonly used baselines, the authors could better justify the choice of these systems with respect to other recent evaluations involving state-of-the-art approaches.

The authors should comment on the reproducibility of the results in the article with code in the repository. The code seems to run (checked for eval_dataset.py), but it is unclear which scripts relate to which tables and whether this also includes baseline methods such as BERT-NER.

The authors could comment on whether the evidence provided for each claim should not also contain information on the source KG from which the sentence was provided. How does the user interpret the confidence value? Is it a probability? Could you,e.g., provide a figure showing a relation between confidence values and the likelihood of the answer being correct?

Review #2
Anonymous submitted on 06/Jan/2023
Suggestion:
Major Revision
Review Comment:

This paper proposes a novel knowledge graph question answering (KGQA) approach, MuHeQA, that tries to overcome two limitations of the existing KGQA systems:
- MuHeQA is KG structure-independent, so it can be applied to various KGs, unlike other systems. MuHeQA achieves this by generating answers from textual content using KGs rather than translating the question into a formal query language.
- MuHeQA does not require domain-specific data to train supervised models. MuHeQA performs textual searches based on the terms identified in the query using an inverse index of the labels associated with the resources to avoid the need to create vector spaces where each resource is represented by its labels.
The novel approach was evaluated in one-hop questions benchmarks over Wikidata and DBpedia.

I would like to say that KGQA is not my domain of expertise. However, the paper is easy to follow, and the topic is interesting. From the beginning, the authors identified the limitations of existing KGQA systems. Still, I think there still needs to be some clarification about how the novel approach overcomes these limitations and how the evaluation demonstrates this.
I leave more detailed considerations about the paper below:

- In the Introduction, it is said that "MuHeQA supports single fact questions (i.e. one subject entity, one relation and one object entity), also known as single-hop questions, and can also handle multiple-hop questions (i.e. more than one relationship) by an iterative process after breaking down the question into single-hop questions". However, the evaluation was only done using one-hop datasets, and throughout the paper some sentences contradict this initial statement (e.g., "As a restriction, our approach only works with single-hop questions instead of multiple-hop questions" and "the next steps are to support multi-hop queries to accept complex questions").
I think the authors should be more clear about this topic.

- I think it would be beneficial if the authors in the related work section presented the state-of-the-art approaches with which MuHeQA is compared in the Results section (namely Falcon 2.0, SYGMA, StaG-QA, and RAG-end2end). Since I am not familiar with these approaches, it is difficult to understand what their specific limitations are and why the proposed system is better (especially considering that sometimes MuHeQA loses in performance).

- For resource identification, the authors used a sequence-to-sequence language model to generate vector representations of the resource and the query. However, no details about the chosen model are described.

- Figure 3 shows the SPARQL queries that retrieve all the related information. From what I understand, only triples that include the resources are captured, which can be a limitation. Could this explain why the answers don't have the expected richness?

- Regarding the evaluation, I was left with doubts about the datasets. I would expect that, for some datasets, more than one KG would be used for one dataset since the novelty of MuHeQA is the ability to handle multiple KGs.

- Table 1 presents the results of keyword identification using various language models that raised some doubts.
What is the justification for a system like BERT to perform so poorly? Why is there such a considerable difference between using BERT (4.46) and ROBERTA (66.65)? I think this should be discussed in more depth.

- My last comment is about the EM score. Is this a good metric for KGQA? If so, why is it only included when unstructured data sources are used?

Typos:
- In line 28 of page 7, the full stop needs to be included.
- In line 43 of page 13, the full stop needs to be included.
- In line 5 of page 35, some acronyms are used, but the authors never introduced them (e.g., WDT, NN, VBD, and RF).

Review #3
By Floriano Scioscia submitted on 13/Jan/2023
Suggestion:
Minor Revision
Review Comment:

The work proposes a Knowledge Graph Question Answering (KGQA) system based whose peculiarity is avoiding the translation of natural language questions into formal KG query languages such as SPARQL: the answer is extracted and verbalized from the KG properties. This has two key benefits with respect to the majority of state-of-the-art KGQA frameworks:
1) the system does not need training on the specific structure of the particular KG(s);
2) multiple KGs as well as unstructured texts can be combined.
Evidence (in the form of phrases supporting the answer) and a confidence value are emitted together with each answer.

A system prototype has been implemented, code and data are provided in a resource. Experimental evaluation has been carried out for all the main aspects of the proposed approach: results show the proposed system is very close to state-of-the-art systems performance-wise, while having easier setup and greater flexibility.

The work appears as scientifically sound and fairly novel. Individual components often leverage well-known techniques, but the overall system architecture and behavior are distinctive enough with respect to existing proposals.
Results are significant because they demonstrate the feasibility and effectiveness of a different type of KGQA approach from the majority of existing proposals, and at the same time they open up some questions for further investigation.
The manuscript is well structured and the proposed methods are described in adequate technical detail.

Some aspects of the manuscript can be improved, as explained in what follows.

The README in the attached data file explains the name MuHeQA stands for Multiple and Heterogeneous Question Answering. This explanation is missing from the manuscript: it would be appropriate to put it upon first mention of MuHeQA in the introduction (page 2, line 23).

At the end of the introduction, a short summary of the subsequent article sections would be useful.

In the final example of Section 3.1.1, some part-of-speech (PoS) tags are used without definition, such as VBN and RP. For a clearer and more complete description of the algorithm, it could be useful to list and explain all PoS tags used by the adopted algorithm.
Most important, it should be clarified whether the PoS tagging algorithm is new or taken from the literature/software libraries, and in the latter case the source should be referenced. (Looking at the 'requirements.txt' file in the attached zip data file, I would presume the NLTK Python library was used for PoS tagging, but it should be stated in the manuscript; this could also help understand the meaning of PoS tags.).
Finally, in figure 2 it does not appear correct that a branch can skip 'Group 2' items: in that case, a keyword would be obtained just with one or two JJ/CC items, which is not like what is described in Section 3.1.1 (and there would be also the trivial match of the regular expression represented by the state machine with the empty string.)

Why a 'LIMIT 250' clause is present in the SPARQL queries shown in Figure 3? Does the provided implementation extract only 250 properties from Wikidata and DBpedia?

In the semantic similarity example in Section 3.1.2, it is not clear how the outcomes of the property level and description level similarity are obtained, more details should be provided.

In Section 3.3, the question 'How many active ingredients does paracetamol have?’ does not make sense. as paracetamol itself is an active ingredient, and the reported answer lists four commercial names for paracetamol, not active ingredients. It is not clear whether the example has been formulated this way by mistake or purposefully (but in that case the purpose should be explained).

In Section 4.2, the adopted metric is not completely clear. In particular, do "for each answer" and "of all answers" (page 10, lines 23-24) actually mean "for each question" and "of all questions", respectively?

Eight references out of 41 are taken from non-peer-reviewed sources like arXiv. Whenever possible, check whether these works have been eventually published in a peer-reviewed venue, and decide on the opportunity of keeping the reference or not. For example, reference [4] was published as a poster paper in ICLR 2021.

Language and style are generally good. Minor issues:
- Missing punctuation or extra spaces in some places.
- In journal articles the usage of the first person ("we") and contracted forms ("let's") is usually discouraged.
- Page 1, line 39: "Knowledge graph Question Answering (KGQA)" -> "Knowledge Graph Question Answering (KGQA)"
- Page 1, line 42: it is better to move footnote 2 and "for property graphs" from page 2, line 1 to there, where Cypher is first mentioned.
- Page 2, line 32: "i.e" -> "i.e."
- Page 7, line 24: "we develop a basic solution" -> " a basic solution has been developed"
- Page 8, line 11: "i.e" -> "i.e."
- Page 9, line 1: "fined-tuned" -> "fine-tuned"
- Page 9, line 29: "aka." -> "a.k.a."
- Page 9, line 22: "question-answer interface" -> "question-answering interface"
- Page 9, line 31: "what means that the structure of the answers are dictated" -> "which means that the structure of the answers is dictated"
- Page 10, lines 17 and 18: "considers valid" -> "considers as valid"
- Page 10, line 20: "selects the three most relevant" -> "selects the three most relevant answers"
- Page 10, line 30: “of a valid answer(s)” -> “of valid answers”
- Page 10, line 40: “(.e.g.” -> “(e.g.”
- Page 13, line 21: “STaF-QA” -> “STaG-QA”
- Page 13, line 30: “highlights” -> “highlight”

DATA FILE
The data file is published on GitHub. It contains a README file with instructions, the source code, data and queries to reproduce the experiments reported in the manuscript. The contents seem complete and sufficient for reproducibility of results.

Review #4
By Takahira Yamaguchi submitted on 28/Jan/2023
Suggestion:
Major Revision
Review Comment:

This paper presents MuHeQA with the following features: a linking method to find KGs relevant to a given question, KGQA algorithm to extract the answer by combining multiple KGs and unstructured data sources, and open-source implementation. MuHeQA consists of three components: Summarization, Evidence Extraction and Answer Composition. MuHeQa has been evaluated by three tasks: identification of keywords, discovery of related resources in KG, and generation of a valid answer.

Looking at this architecture, it is my first question why do you not take domain ontologies with knowledge graph. Because they present concept hierarchy and property semantics to KG, we would have QA systems with ontologies and KG.

Secondly, Comparing MuHeQA with other KGQA systems, the experiment shows us that MuHeQA provides the performance close to the best system, StaF-QA and better than Falcon2.0 and SYGMA. However, because the Simple Questions data set is rather old in 2015, I am wondering how much well MuHwQA work in more recent QA data set, such as Stanford Question Answering Dataset (SQuAD). Furthermore, I want to know how much well MuHwQA work, compared with other deep learning-based QA systems. In particular, as many users pay much attention to ChatGPT, it would be great if you could explain in what cases MuHwQA is better than ChatGPT.