Review Comment:
Review of “Overcoming Challenges of Semantic Question Answering in the Semantic Web”
The paper "Overcoming Challenges of Semantic Question Answering in the Semantic Web" presents an interesting and exhaustive overview of the field of Semantic Question Answering through the discussion of 62 systems as they emerge from the 2010-2015 scientific literature. The rules adopted for the selection of systems and the choice of criteria used to survey their main aspects are clearly presented and provide a further contribution of the paper.
The paper in fact introduces early the selection policy and the definitions/guidelines used to delimit the scope of the analysis in the SQA area. In this way, the paper provides a specific focus on Question Answering as the process of retrieval formalized information (RDF triples or structured relations) from knowledge repositories, typical of the Semantic Web (SW).
The paper is valuable as for its synthesis over a large and critical area of the current SW research. It is mostly clear and its coverage is good.
In the review, I would like to further discuss three major issues related to the current version of the paper.
1. Coverage and validity of the adopted notion of SQA
2. Organisation of the paper that embodies a structured overview of this very broad area
3. Paper impact, i.e. if the paper has or not contributed to the aim of its title.
1. Coverage and Validity of SQA
In my view, by focusing ONLY on the retrieval of structured data, the adopted QA notion is consistent but risk not to give a complete account of the field that has most to do with the systematic integration of unstructured information. I refer, for example, to the current work on topics such as textual similarity, paraphrasing as well as textual entailment and their impact on retrieval of passages or other pointwise information that does not assume any specific KB being available (with reference, i.e. gold, entity information or typed relations). This work is about semantic tasks dealing with text understanding and retrieval but is here underrepresented. An example is the PARALEX approach, correctly cited in the paper, and firmly based on learning from text. However, the PARALEX approach is representative of a wider set of open domain QA systems, such as the one presented in (Bordes et al, 2014). These make use of paraphrase learning methods that integrate linguistic generalization (e.g. neural embeddings) with knowledge graph biases. Neglecting this line of research (just because it is not directly insisting on RDF-like resources nor employing explicitly disambiguation steps that depend on some form of reasoning) can be seen as a limitation of this paper.
In general, here low attention is posed in the paper to distinctions that characterize the approaches at the level of the type and nature of the employed inference algorithms. Generative, discriminative inductive methods as well as symbolic methods are seemingly discussed and, when targeted to a single phenomenon, mixed in the same sections. The choice is to survey SQA phenomena and detect the underlying (application) challenges rather than to focus on the involved functionalities, i.e. onto one (or more) of their possible decomposition in a range of tasks (and thus methods to solve them).
As an outcome no consensus seems to be found around a general architecture for the QA process so that no discussion can be traced around best practices related to individual subtasks.
It must be said that the area has a very broad nature and heterogeneous methods do not allow easily to define on the one side analogies among systems and best practices and, onm the other side, a reference architectural decomposition. However, there is no effort in the authors to go in such a direction.
I think that this does not help the paper to achieve the goal to shed more light on the field.
2. Paper organisation.
The core of the paper is to discuss some specific challenges that SQA system seem to face today. The different reference challenges defined and discussed in Section 5 are:
• Lexical Gap
• Ambiguity
• Multilingualism
• Complex Queries (but it Operators In Table 5)
• Distributed Knowledge
• Procedural, Temporal or Spatial Questions
• Templates
For each of the above challenges, first the main solutions provided by surveyed SQA systems are introduced, as an exemplification of the challenge. They are also used as triggers for a comparative discussion among different techniques proposed. Then, in Section 6, a general analysis is provided as a way to detect trends and prospects of the SQA research in the near-medium term.
I have to say that the selection of the challenges is not clearly motivated, as some of them seems to be poorly representative of the field (e.g. multilingualism as an issue covered by very few systems) and, on the other hand, some of them are vaguely defined and possibly cluster a too large area of research whose comparative analysis is very complex.
Lexical Gap. The lexical gap issue for example is representative of a too large set of phenomena to be presented as a unique challenge. As in the introductory text, it is defined as the problem of mapping text tokens into the KB primitives:
"Each textual tokens in the question needs to be mapped to a Semantic Web-based individual, property, class or even higher level concept. Most natural language questions refer to concepts, which can be concrete (Barack Obama) as well as abstract (love, hate). Similarly, RDF resources, which are designed to represent concepts, are characterized by binary relationships with other resources and literals, forming a graph. However, natural language text is not graph-shaped but a sequence of characters which represent words or tokens, whose relations form a tree."
This problem is not a lexical problem, as the authors admit by mentioning the (ontological?) mismatch between RDF graphs and the syntagmatic nature of words graphs or parse trees: it is not clear here if the authors refers to constituent based grammatical approaches that employ in fact parse trees among syntagmatic structures, e.g. complex noun phrases, or dependency based approaches that represent grammar through binary relations among words, e.g. heads of complex phrases.
However, in synthesis, I see several independent issues clustered in this challenge:
1. The lexical mismatch between named entities (or linguistic labels for other more abstract concepts) and knowledge graph node names
2. The complexity in the interpretation of individual linguistic relations (as recognized at the level of grammatical representation of sentences/queries) in terms of semantic or conceptual relations, interoperable with the semantics of the targeted RDF KBs
3. The complexity of the overall matching between grammatical graphs and knowledge graph whereas all grammatical relations in the query interacts with all the involved arcs in a knowledge and joint inferences are required.
In the proposed subfields, i.e. "Normalization and Similarity", "Automatic Query Expansion", "Pattern libraries", "Entailment" and "Document Retrieval Models for RDF resources" and "Composite Approaches", a mix of solutions (e.g. "Query Expansion" ) vs. phenomena (e.g. "Entailment"), of algorithmic techniques (e.g. "Normalization") wrt. modelling paradigms (e.g. "Similarity") is presented and it is not helpful in developing a clear picture of the topic (i.e. the kind of challenge targeted) and the surveyed contributions (i.e. the scientific framework in which the research can be organised).
Again, I think that an "architectural" approach that proceeds from a decomposition of the problem (e.g. LexicalMatching < SyntacticInterpretation < EntityMatching < SemanticRelationMapping < JointInterpretationOfEntitiesAndRelations) to paradigms and methods for each step would have been clarifying. Notice how much this “Lexical Gap” challenge is in overlap with the "Ambiguity" area.
As a general suggestion I would thus reorganise the discussion or introduce it with a clear picture of the involved subtasks related to each challenge. In any case I would
- Rename the "Lexical Gap" area by merging it with the "Ambiguity" one, as for their gross overlaps;
- Avoid to mention explicitly sub-challenges (e.g. "Document Retrieval models for RDF Resources") that are exemplified by only one system.
- Keep quite independent tasks (such as "Normalization" vs. "Similarity-based lexical matching") separate
A possible renaming of the labels defined for the challenges is the following:
Current Challenge Suggested labelling
Lexical Gap + Ambiguity Semantic Interpretation
Multilingualism (may be not needed, see below)
Complex Queries Question Expressivity
Distributed Knowledge Knowledge Locality and Heterogeneity
Procedural, Temporal or Spatial Questions Question Types and Retrieval Complexity
Templates KB Query Formalism
Notice how all the proposed labels are phenomena/processes (e.g. Knowledge Locality and Heterogeneity, Semantic Interpretation) or methodologies/solutions (e.g. KB Query Formalism).
3. Paper impact
It is important in the light of the above observation to establish if the paper succeeds or not to achieve its aims. I feel that the paper has a strong potential to shed light on the area, but it is not fully achieved in this version. As for its coverage, the paper is very valuable and for this reason I strongly think it is to be accepted for the publication.
My problem is that given its current organisation (developed around systems and challenges) the paper does not fully clarify which are the computational trends in the area, such as:
- Which techniques are mostly useful?
- Which are the subtasks that mostly impact in the quality of the overall SQA chain?
- Which are the missing aspects that are
(1) already studied substaks for which accurate solutions do not exist yet, or
(2) challenges that have been underestimated, for which more work is needed
On the contrary, several statements in Section 6, are not fully justified.
An example is the discussion about multilingualism:
"By this means, future research has to focus on language-independent SQA systems to lower the adoption effort. For instance, DBpedia [76] provides a knowledge base in more than 100 languages which could form the base of a next multilingual SQA system."
It is not clear why a language independent SQA is necessary. Is it useful or simply necessary, given the resources in different language? The problem here is that it is not even clear what does "language independent SQA" means given that no system seems to make a clear use of a multilingual technique and no definition is given: is independence related to the language of the question or to the language to which descriptions in the KB make reference? This difference is deep as completely independent issues are targeted in the two cases.
In synthesis, again, an architectural view would have been helpful in focusing the discussion on real (empirically verified) limitations of the current technology and providing some basis to sketch a roadmap for future research.
Summary Review
(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
The paper is very well suited as an introduction to the field, as for its coverage and clarity of focus.
(2) How comprehensive and how balanced is the presentation and coverage.
The proposed material fails to capture some relevant aspects of the current research in QA, especially for what concerns QA over unstructured data, but this is a strongly motivate choice of the authors.
(3) Readability and clarity of the presentation.
The presentation is very clear with a good impact on readability.
I do not agree on some of the adopted definitions that are misleading to my view, but a suitable renaming is proposed above.
(4) Importance of the covered material to the broader Semantic Web community.
The lack of an architectural view of the SQA process does not allow to properly frame most of the discussion. A more process oriented view would have been beneficial. The authors are requested to improve their manuscript by devising a general reference (and comprehensive) architecture of a SQA process and then discuss most of the current material according to such an organised view.
Pointwise observations
Page 5. “Answer presentation”.
Why (just after the review of existing systems that does not give any structural view on their general workflow) the first aspect to discuss is Answer Presentation?
I find it a secondary issue in the organization of a general view on the field and would postpone this discussion laer on in the paper. Afterall, entity summarization (i.e. Cheng et al. [22] ….) IS NOT verbalization of RDF triples (i.e. Ngonga Ngomo et al.). and the cluster is not entirely justified.
Page 5. “Thus, a new research sub field focusses on question answering frameworks, i.e., frameworks to combine different SQA systems”
The notion of framework here intended is not clear. I see that is not just software framework, but mostly methodological frameworks, used to make different tools compatible within a unified QA architecture. This notion should be discussed at length by outlining typical examples of components (or subsystems) to be reused and whose integration require a common framework. It is likely required here an independent subsection.
Page 6. “Lexical Gap”
I do not think it is just a lexical gap, but it is more precisely a gap between linguistic knowledge and the target encyclopaedic knowledge of an RDF repository.
After all a sense is the outcome of the entire information expressed in a sentence and thus it is the outcome of interpretation. It is not just a dictionary phenomenon. i.e. information missing from the lexicon.
Page 6. “Normalization & Similarity”
This maybe means "Advanced candidate matching techniques": normalization (as well as lemmatization) or fuzzy matching are all methods to optimize candidate identification for SQA at the lexical level. But similarity, as a phenomenon that depends on the language or on the theory about a KB, is NOT a process/technique such as “Normalization”.
Page 7. “Patterns”
Knowledge patterns involved here can refer to RDF triples or to textual structures. In the latter case I would call them linguistic patterns. Knowledge patterns would be more precise in the former one. It seems to me that the first choice is better here. Notice that patterns are also used to infer ranking of candidate answers and should thus be also linked to section 5.2.
Page 7. “Entailment”
I am not sure it is a good choice, as entailment is also a logic property between formulas (i.e. Knowledge subgraphs). Notice that entailment between texts (e.g. questions and answer pairs) is the focus of a large area of research called Textual Entailment Recognition and it can lead to confusion here.
Page 8. “Ambiguity”
Here it seems that two kinds of ambiguity are discussed: sentence ambiguity, that is ambiguity in the linguistic interpretation of the question, as well as ambiguity in the matching of entities as answer candidates. This should be better clarified as the different subsection (Semantic vs. Statistical disambiguation) deal in fact with both problems interchangeably.
Page 9. Semantic Disambiguation. “While statistical disambiguation works on the phrase level, semantic disambiguation works on the concept level.”
Quite critical distinction...
Semantics is always the OUTCOME of a disambiguation process, so that it is always proceeding at a concept level. Statistical methods are just OFTEN tight only to lexical information and its distributional behaviour, making explicit use of these properties within a quantitative inference model. Semantic is more often only tight to capture dictionary sense or KB entity information, but (1) most methods work in an hybrid manner, (2) at both level quantitative approaches and algorithmic is usually applied. Graph-based models (e.g. random surfer approaches) applied to knowledge graphs are always statistical in nature.
Page 10. “Alternative approaches”
They seem mostly approaches where natural language processing is not applied, so that a sort of controlled language is adopted for querying. The title does not capture this aspect explicitly.
Page 11. “GETARUNS”
The original reference to GETARUNS should be
Delmonte, R. (2008). Computational Linguistic Text Processing - Lexicon, Grammar, Parsing and Anaphora Resolution. New York: Nova Science Publishers. ISBN: 978-1-60456-749-6.
Pag 11. Footnote 20: Such as “List the Semantic Web people and their affiliation.”
If you decide that an explanation of the notion of coreference is needed, then you should be more explicit here, mentioning explicitly the coreferent "their" and the referred entity ... "people".
Page 11. “… handling procedural questions ….”
Is this still QA in your initial assumption? Why is this case different from complex but plain document oriented retrieval?
Page 11. “… statistic distribution…”
… statistical distribution …
Page 12. “Xu et al [12] …”
You need to introduce here also the Xser acronym of the corresponding system here, as it is referred afterward in the text. I think that the reference to the system rather than to the authors is better here.
Page 14. Conclusion
In the text
“Future research should be directed at more modularization, automatic reuse, self-wiring and encapsulated modules with their own benchmarks and evaluations. Thus, novel research field can be tackled by reusing already existing parts and focusing on the research core problem itself.”
This is exactly what the overview od section 4 and 5 do not allow to define: Reusable modules and parts are never defined, even in section 6.
“Another research direction are SQA systems as aggregators or framework for other systems or algorithms to benefit of the set of existing approaches.”
The space dedicated to outline a potential unifying framework is negligible. As it is here reported as a future research direction is should be more carefully traced.
“Furthermore, benchmarking will move to single algorithmic modules instead of benchmarking a system as a whole.“
The target of local optimization is benchmarking a process at the individual steps, but global benchmarking is still needed to measure the impact of error propagation across the chain. A Turing test-like spirit would suggests that the latter is more important, as the local measure are never fully representative.
References.
A. Bordes, J. Weston, and N. Usunier. Open question answering with weakly
supervised embedding models. In Proceedings of ECML-PKDD'14. Springer, 2014.
|