Multilingual Question Answering over Linked Data building on a model of the lexicon-ontology interface

Tracking #: 3619-4833

Authors: 
Mohammad Fazleh Elahi
Basil Ell
Gennaro Nolano
Philipp Cimiano

Responsible editor: 
Axel Polleres

Submission type: 
Full Paper
Abstract: 
Multilingual Question Answering (QA) has the potential to make the knowledge available as linked data accessible across languages. Current state-of-the-art QA systems for linked data are nevertheless typically monolingual, supporting in most cases only the English language. As most state-of-the-art systems are based on machine learning techniques, porting such systems to new languages requires a training set for every language that should be supported. Furthermore, most recent QA systems based on machine learning methods lack controllability and extendability, thus making the governance and incremental improvement of these systems challenging, not to mention the initial effort of collecting and providing training data. Towards the development of QA systems that can be ported across languages in a principled manner without the need of training data and towards systems that can be incrementally adapted and improved after deployment, we follow a model-based approach to QA that supports the extension of the lexical and multilingual coverage of a system in a declarative manner. The approach builds on a declarative model of the lexicon ontology interface, OntoLex lemon, which enables the specification of the meaning of lexical entries with respect to the vocabulary of a particular dataset. From such a lexicon, in our approach, a QA grammar is automatically generated that can be used to parse questions into SPARQL queries. We show that this approach outperforms current QA approaches on the QALD benchmarks. Furthermore, we demonstrate the extensibility of the approach to different languages by adapting it to German, Italian, and Spanish. We evaluate the approach with respect to the QALD benchmarks on five editions (i.e., QALD-8, QALD-7, QALD-6, QALD-5, and QALD-3) and show that our approach outperforms the state-of-the-art on all these datasets in an incremental evaluation mode in which additional lexical entries for test data are added. For example, on QALD-9, our approach obtains F1 scores of 0.85 (English), 0.82 (German), 0.65 (Italian), and 0.83 (Spanish) in an incremental evaluation mode in which lexical entries covering test data questions are added. So far there is no system described in the literature that works for at least 4 languages while reaching state-of-the-art performance on all of them. Finally, we demonstrate the low efforts necessary to port the system to a new dataset and vocabulary.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/May/2024
Suggestion:
Accept
Review Comment:

# Originality
The paper presents an advanced approach to multilingual QA using the OntoLex-Lemon model, addressing significant limitations in existing systems.
Key Components are: i) Lexicon-Ontology Interface, ii) Grammar Rule Templates and iii) Multilingual Support
# Significance of the Results
The approach's demonstrated performance across multiple languages and benchmarks. Evaluation results show state-of-the-art performance on QALD benchmarks, with the system outperforming existing QA systems, including ChatGPT.
# Quality of Writing
The writing is clear and organised
# Limitations
The system is currently limited by the need for manual creation of lexical entries and grammar templates, which can be time-consuming.

Review #2
Anonymous submitted on 20/Jun/2024
Suggestion:
Reject
Review Comment:

This paper thoroughly describes a template-based approach to multilingual question answering, which requires crafting lexical entries and sentence templates that capture different frames, i.e., syntactic structures. The authors demonstrate that this approach can be applied to other languages requiring hours of manual work performed by language experts, who received some minimal training/instruction. To create sentence templates the language experts receive five sample questions for each lexical entry, such as a noun, syntactic categories and a dictionary with the corresponding set of words for each category. The language experts are required to manually craft four variations for each sentence template. Thereby, each noun can be described by 20-24 manually constructed sentence templates.

The authors show the merit of the proposed approach by evaluating it on five QALD datasets against previously proposed approaches, including machine-learning based approaches, and conclude that their approach outperforms all state-of-the-art approaches, including ChatGPT.

Strengths:
* The authors go into a great detail explaining their approach and provide a lot of examples to illustrate the main concepts.
* Adding results of the ChatGPT baseline is commendable and its error analysis seems very insightful.

Weaknesses:
* The major concern in the provided results is that the authors are using the test set to create their templates, referred to as ‘incremental’ in the paper, meaning that their results are not comparable with the machine-learning approaches that do not have access to the test set. Using the test set to create the model is considered to be a fundamental methodological flaw in machine learning, which effectively jeopardises the purpose of evaluation, i.e., evaluating the approach on previously unseen data to evaluate its ability to handle examples unseen during training.
* Moreover, looking at the results reported for ‘incremental’ versus ‘inductive’ settings, it is very clear that the proposed template-based approach is not generalisable and, therefore, can not scale, naturally requiring the language experts to come up with new templates for every previously unseen sentence template. See Table 8, on QALD-9, the proposed approach reaches F1=0.25 when not using the test test, which is below the majority of other approaches evaluated on this test set.
* The authors claim to use ChatGPT in a one-shot scenario but actually perform zero-shot, i.e., not giving any examples in the prompt to the system. This is clearly not a fair comparison to their approach, since even human annotators were given 5 examples to create the sentence templates. The authors should provide at least some examples of questions and correct SPARQL queries, which is likely to boost the performance of ChatGPT on this task.

Minor issues:
* The authors highlight their system result in Table 10 while WDAqua has higher performance on Italian and should be highlighted instead.
* In the introduction, the authors mention PGMs and LSTMs among the current state-of-the-art QA systems, which is clearly not the case. They also enumerate deep neural networks, transformers and language models as if there was no relation between all three.

Review #3
Anonymous submitted on 11/Jul/2024
Suggestion:
Major Revision
Review Comment:

The paper presents a model-based approach for question answering (QA) over linked data, which is easily extensible. New syntactic constructions can be added to the system to efficiently handle new types of questions. This is achieved without extensive retraining, providing a more controlled and predictable outcome than traditional machine learning (ML) systems.
The system is designed to be portable across multiple languages. It has been successfully adapted to German, Spanish, and Italian, demonstrating its versatility and effectiveness beyond English. This adaptability is crucial in creating a more inclusive and accessible QA system.

The paper provides thorough incremental evaluation results, showing that the approach outperforms existing state-of-the-art ML-based and rule-based approaches. The evaluations were conducted on multiple QALD dataset editions, ensuring the finding's robustness and reliability.

The model allows for quick and efficient addition of new question types. For example, adding a sentence template to support specific questions can significantly increase the system’s parsing capability within minutes, illustrating its practical usability and scalability.

The system relies heavily on creating lexical entries and sentence templates, which requires linguistic expertise. This could be a potential bottleneck in scaling the system to new languages or domains without sufficient linguistic resources.

The approach is primarily focused on structured data within RDF datasets. It must address the challenges of dealing with unstructured or semi-structured data, which is common in real-world applications. This limitation restricts the system’s applicability to a broader range of data sources.
Although the paper highlights the limitations of ML-based systems, it needs to provide a comprehensive comparative analysis of specific ML systems that may have addressed them to some extent. This can be seen as a gap in demonstrating the relative advantage of the proposed approach.