MQALD: Evaluating the impact of modifiers in Question Answering over Knowledge Graphs

Tracking #: 2701-3915

Authors: 
Lucia Siciliani
Pierpaolo Basile
Pasquale Lops
Giovanni Semeraro

Responsible editor: 
Harald Sack

Submission type: 
Dataset Description
Abstract: 
Question Answering (QA) over Knowledge Graphs (KG) aims to develop a system that is capable of answering users' questions using the information coming from one or multiple Knowledge Graphs, like DBpedia, Wikidata, and so on. Question Answering systems need to translate the user's question, written using natural language, into a query formulated through a specific data query language that is compliant with the underlying KG. This translation process is already non-trivial when trying to answer simple questions that involve a single triple pattern. It becomes even more troublesome when trying to cope with questions that require modifiers in the final query, i.e., aggregate functions, query forms, and so on. The attention over this last aspect is growing but has never been thoroughly addressed by the existing literature. Starting from the latest advances in this field, we want to further step in this direction. This work aims to provide a publicly available dataset designed for evaluating the performance of a QA system in translating articulated questions into a specific data query language. This dataset has also been used to evaluate three QA systems available at the state of the art.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Simon Gottschalk submitted on 21/Feb/2021
Suggestion:
Minor Revision
Review Comment:

The authors introduce the dataset MQALD, which is made of SPARQL queries with modifiers, their results and verbalisations. As most QA systems struggle with SPARQL queries that contain modifiers, this is a relevant dataset for evaluating the expressiveness of QA systems. Part of the dataset comes from the existing QALD challenges, plus 100 newly created queries. I appreciate the author comments and the changes that were made for the revised version. However, I am still not totally convinced by some of the answers, or the corresponding updates in the paper lead to new questions:

- Analysis of QA systems: I stay with my comment that I would except an analysis beforehand about which systems consider what types of modifiers. The authors replied that "The systems considered for the evaluation do not address explicitly the problem of modifiers but are still capable of answering questions with modifiers", which is exactly the issue I am interested in: How can a system not consider modifiers but still answer questions with modifiers? This is later answered via examples (e.g., concering the "UNION" modifier"), but it would be better to analyse the systems before: For example, TeBaQA seems to extract templates from the QALD training queries, so it should be capable of handling modifiers (when included in the training queries). gAnswer, on the other hand, seems to create a subgraph that does not include modifiers (except for "count"?).
The authors (still) state that "Unfortunately, only the QAnswer API allows obtaining the SPARQL translation of the question created by the system.". According to the API descriptions, both TeBaQA (https://github.com/dice-group/TeBaQA, see the QuestionAnsweringController) and gAnswer (https://github.com/pkumod/gAnswer, see 'sparql":["select DISTINCT ?wife where '...) do return the SPARQL query as part of the results. Therefore, I am a bit puzzled by the author's reply. However, while writing this review, the public APIs to these two systems are down, so i can not further investigate this issue.

- GERBIL integration: In the revised paper and the author comments, the authors write that they are working on the inclusion of MQALD to GERBIL. This is good, but I cannot take promises into account for my review. I hope this issue can be fixed. Also, footnote 17 in the revised paper still sounds a bit odd.

- Dataset availability and usage statistics: 16 unique downloads (presumably including authors and reviewers) do not justify the "Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided", as required in the dataset article review criteria.

- Query creation: The process of query creation is more detailed in the revised version. I still have questions about the process, though, specifically about the sentence "choosing random
resources of DBpedia as a seed of the question or exploiting question templates". If the selection of entities was random, why are 6 out of 100 queries about Elvis Presley? If these were random resources, how comes they are all very prominent entities (Elvis Presley, Barack Obama, Marlon Brando, ...)? How were the query templates selected - and why did this lead to such a un-uniform distribution of modifiers, as shown in Table 1 (e.g., 41 of the queries use a "FILTER" and 0 use "NOW").

- Versioning: According to the Zenodo repository, all versions of the MQALD dataset have been published on May 21, 2020, which is not the case. Proper publication dates for the different versions would make the updates more transparent. Also, in Section 3.4, the authors say "We published the dataset in May 2020". This only refers to the first version of the dataset and should be made clear.

- Dataset: The questions extracted from QALD in the "MQALD.json" dataset are missing a value for "qald-version". Also, "hybrid" is always false and never described (However, I understand that this comes from the QALD JSON structure).

Review #2
By Ricardo Usbeck submitted on 25/Feb/2021
Suggestion:
Accept
Review Comment:

# Summary/Description

The revised article describes MQALD, version 3 (29/10/2021) of a modified QALD dataset and a newly generated dataset that focuses on SPARQL operation modifiers.
Thanks to the authors for answering our questions in the cover letter. We also want to acknowledge the excellent work of the other two reviewers.

# Short facts

Name: MQALD
URL: https://zenodo.org/record/4479876 , https://github.com/lsiciliani/MQALD (updated January 29, 2021)
Version date and number: 3.0., May 21, 2020
Licensing: MIT
Availability: guaranteed
Topic coverage: not applicable
Source for the data: The existing QALD - benchmark series https://github.com/ag-sc/QALD
Purpose and method of creation and maintenance: By extracting SPARQL queries containing modifiers and adding 100 novel questions.
Reported usage: From Zenodo - 75 views and 23 downloads at the time of review (24.02.2021)
Metrics and statistics on external and internal connectivity: None.
Use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF): QALD JSON plus extension
Language expressivity: English, Italian, French, Spanish.
Growth: Small, based on community feedback.
5 star-data?: no.

# Quality and stability of the dataset - evidence must be provided

Given the restructured Section 3, the description quality has improved. It also opens future research questions, which is beneficial to the community. The dataset seems stable, given its availability via a PID.

However, I do not see an open issue at GERBIL QA (https://github.com/dice-group/gerbil/issues?q=is%3Aissue+is%3Aopen+mqald) for integrating MQALD. However, I trust the authors to upload it there as soon as possible.

The dataset is still relatively small but highly diverse and thus arguable valid to test but not to train a KGQA system.

# Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided.

The dataset has already proven its usefulness for the KGQA community by evaluating three SOTA systems thoroughly.

# Clarity and completeness of the descriptions.

The paper is well-written, and the description is clear, which enables replication.

# Overall impression and Open Questions

MQALD can become a cornerstone for future research. Its description is helpful and will bring the KGQA community forwards.

The only open question arises on page 5: "KG independent" Could you please elaborate on this point and bring an example?

# Minor issues

P1, left l 49, is not a grammatically correct sentence.
P2, left l 5, not a grammatically correct sentence.
P3, left l 4, there is an "only" missing.
P4, right l3, "into" => "in".
P12, right l 49, strange newline.

Check all references, that Arxiv references such as [19] are also available as peer-reviewed versions, see also https://github.com/yuchenlin/rebiber

Review #3
Anonymous submitted on 27/Feb/2021
Suggestion:
Accept
Review Comment:

Thanks for your comments and clarification. The paper now looks in a good shape. However, I still have one major concern about the annotation part. Annotation plays a vital role in this work. After the annotation is done, it is also crucial to evaluate the quality of the annotate data. In case of acceptance, I would expect more clarification from the author about the evaluation process of annotated data.

Major Issue:
———————
The paper now looks in a good shape. However, I still have one major concern about the annotation part. Annotation plays a vital role in this work. After the annotation is done, it is also crucial to evaluate the quality of the annotate data. In case of acceptance, I would expect more clarification from the author about the evaluation process of annotated data. Have you evaluated the quality of annotated data? If so, then how do you evaluate the quality of the annotate data? Any metric or measurement? And what is the inter-annotate agreement score (Cohen kappa [1])?

Minor Issue:
———————
- Although Listing 2 got slightly improved in this version of the submission. However, it would make more sense if the listing takes two column width so that the start and end of braces are easily distinguishable with proper indentation which would make the listing more redable.
- In page 9, the last equation F(q) is not properly aligned with the previous one.

References:
1. https://en.wikipedia.org/wiki/Cohen%27s_kappa