Review Comment:
This paper thoroughly describes a template-based approach to multilingual question answering, which requires crafting lexical entries and sentence templates that capture different frames, i.e., syntactic structures. The authors demonstrate that this approach can be applied to other languages requiring hours of manual work performed by language experts, who received some minimal training/instruction. To create sentence templates the language experts receive five sample questions for each lexical entry, such as a noun, syntactic categories and a dictionary with the corresponding set of words for each category. The language experts are required to manually craft four variations for each sentence template. Thereby, each noun can be described by 20-24 manually constructed sentence templates.
The authors show the merit of the proposed approach by evaluating it on five QALD datasets against previously proposed approaches, including machine-learning based approaches, and conclude that their approach outperforms all state-of-the-art approaches, including ChatGPT.
Strengths:
* The authors go into a great detail explaining their approach and provide a lot of examples to illustrate the main concepts.
* Adding results of the ChatGPT baseline is commendable and its error analysis seems very insightful.
Weaknesses:
* The major concern in the provided results is that the authors are using the test set to create their templates, referred to as ‘incremental’ in the paper, meaning that their results are not comparable with the machine-learning approaches that do not have access to the test set. Using the test set to create the model is considered to be a fundamental methodological flaw in machine learning, which effectively jeopardises the purpose of evaluation, i.e., evaluating the approach on previously unseen data to evaluate its ability to handle examples unseen during training.
* Moreover, looking at the results reported for ‘incremental’ versus ‘inductive’ settings, it is very clear that the proposed template-based approach is not generalisable and, therefore, can not scale, naturally requiring the language experts to come up with new templates for every previously unseen sentence template. See Table 8, on QALD-9, the proposed approach reaches F1=0.25 when not using the test test, which is below the majority of other approaches evaluated on this test set.
* The authors claim to use ChatGPT in a one-shot scenario but actually perform zero-shot, i.e., not giving any examples in the prompt to the system. This is clearly not a fair comparison to their approach, since even human annotators were given 5 examples to create the sentence templates. The authors should provide at least some examples of questions and correct SPARQL queries, which is likely to boost the performance of ChatGPT on this task.
Minor issues:
* The authors highlight their system result in Table 10 while WDAqua has higher performance on Italian and should be highlighted instead.
* In the introduction, the authors mention PGMs and LSTMs among the current state-of-the-art QA systems, which is clearly not the case. They also enumerate deep neural networks, transformers and language models as if there was no relation between all three.
|