Review Comment:
The authors introduce the dataset MQALD, which is made of SPARQL queries with modifiers, their results and verbalisations. As most QA systems struggle with SPARQL queries that contain modifiers, this is a relevant dataset for evaluating the expressiveness of QA systems. Part of the dataset comes from the existing QALD challenges, plus 100 newly created queries. I appreciate the author comments and the changes that were made for the revised version. However, I am still not totally convinced by some of the answers, or the corresponding updates in the paper lead to new questions:
- Analysis of QA systems: I stay with my comment that I would except an analysis beforehand about which systems consider what types of modifiers. The authors replied that "The systems considered for the evaluation do not address explicitly the problem of modifiers but are still capable of answering questions with modifiers", which is exactly the issue I am interested in: How can a system not consider modifiers but still answer questions with modifiers? This is later answered via examples (e.g., concering the "UNION" modifier"), but it would be better to analyse the systems before: For example, TeBaQA seems to extract templates from the QALD training queries, so it should be capable of handling modifiers (when included in the training queries). gAnswer, on the other hand, seems to create a subgraph that does not include modifiers (except for "count"?).
The authors (still) state that "Unfortunately, only the QAnswer API allows obtaining the SPARQL translation of the question created by the system.". According to the API descriptions, both TeBaQA (https://github.com/dice-group/TeBaQA, see the QuestionAnsweringController) and gAnswer (https://github.com/pkumod/gAnswer, see 'sparql":["select DISTINCT ?wife where '...) do return the SPARQL query as part of the results. Therefore, I am a bit puzzled by the author's reply. However, while writing this review, the public APIs to these two systems are down, so i can not further investigate this issue.
- GERBIL integration: In the revised paper and the author comments, the authors write that they are working on the inclusion of MQALD to GERBIL. This is good, but I cannot take promises into account for my review. I hope this issue can be fixed. Also, footnote 17 in the revised paper still sounds a bit odd.
- Dataset availability and usage statistics: 16 unique downloads (presumably including authors and reviewers) do not justify the "Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided", as required in the dataset article review criteria.
- Query creation: The process of query creation is more detailed in the revised version. I still have questions about the process, though, specifically about the sentence "choosing random
resources of DBpedia as a seed of the question or exploiting question templates". If the selection of entities was random, why are 6 out of 100 queries about Elvis Presley? If these were random resources, how comes they are all very prominent entities (Elvis Presley, Barack Obama, Marlon Brando, ...)? How were the query templates selected - and why did this lead to such a un-uniform distribution of modifiers, as shown in Table 1 (e.g., 41 of the queries use a "FILTER" and 0 use "NOW").
- Versioning: According to the Zenodo repository, all versions of the MQALD dataset have been published on May 21, 2020, which is not the case. Proper publication dates for the different versions would make the updates more transparent. Also, in Section 3.4, the authors say "We published the dataset in May 2020". This only refers to the first version of the dataset and should be made clear.
- Dataset: The questions extracted from QALD in the "MQALD.json" dataset are missing a value for "qald-version". Also, "hybrid" is always false and never described (However, I understand that this comes from the QALD JSON structure).
|