Is Neuro-Symbolic AI Meeting its Promise in Natural Language Processing? A Structured Review

Tracking #: 3060-4274

Kyle Hamilton
Aparna Nayak1
Bojan Božić1
Luca Longo

Responsible editor: 
Guest Editors NeSy 2022

Submission type: 
Survey Article
Advocates for Neuro-Symbolic AI (NeSy) assert that combining deep learning with symbolic reasoning will lead to stronger AI than either paradigm on its own. As successful as deep learning has been, it is generally accepted that even our best deep learning systems are not very good at abstract reasoning. And since reasoning is inextricably linked to language, it makes intuitive sense that Natural Language Processing (NLP), would be a particularly well-suited candidate for NeSy. We conduct a structured review of studies implementing NeSy for NLP, challenges and future directions, and aim to answer the question of whether NeSy is indeed meeting its promises: reasoning, out-of-distribution generalization, interpretability, learning and reasoning from small data, and transferability to new domains. We examine the impact of knowledge representation, such as rules and semantic networks, language structure and relational structure, and whether implicit or explicit reasoning contributes to higher promise scores. We find that knowledge encoded in relational structures and explicit reasoning tend to lead to more NeSy goals being satisfied. We also advocate for a more methodical approach to the application of theories of reasoning, which we hope can reduce some of the friction between the symbolic and sub-symbolic schools of AI.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vered Shwartz submitted on 15/Mar/2022
Review Comment:

This manuscript presents a meta analysis of neuro-symbolic methods in NLP, with the goal of determining whether and to what extent such methods fulfill their intended goals. The goals discussed are out-of-distribution generalization, interpretability, reduced amount of training data, transferability to new tasks, and reasoning. The article analyzes the papers from different perspectives including the task, type of learning, format of symbolic knowledge, combination method, etc. The conclusion is that there is little work on neuro-symbolic methods in NLP, most of which incorporate symbolic knowledge in a shallow way (i.e. embedding it into the network), and that most methods don’t deliver results on these 5 goals. However, the few models that combine the neural and symbolic components in more interconnected ways perform better on most goals.

This was a very interesting read and I appreciate the formalization of terminology in neuro-symbolic NLP. The findings were along the lines of what I intuitively expected, so it’s nice to have this manuscript provide empirical evidence for this intuition. The main limitations of this paper are the small number of papers analyzed, and in sometimes failing to analyze the findings and presenting the data as is (see details below).

Dimensions of survey papers:

(1) Suitability as an introductory text - the manuscript was easy for me to read but for the sake of a reader knowing little about neuro-symbolic methods, it may be worth looking in depth into 2-3 papers from the analysis, including a figure showing the task, symbolic knowledge, neural component, and combination type.

(2) How comprehensive and how balanced is the presentation and coverage - unfortunately, a small number of papers was analyzed. I don’t know if this is because the selection criteria was too restrictive or because there is indeed very little work on neuro-symbolic NLP. I believe it’s mostly due to the former.

(3) Readability and clarity of the presentation - overall was very good. The only comment I have is that some of the graphs and tables provide raw data which the text doesn't analyze in depth. It would be good to mention in a few sentences even if there is no signal/findings in the particular experiment.

(4) Importance of the covered material to the broader Semantic Web community - neuro-symbolic NLP may have implications for reasoning over ontologies and knowledge graphs and information retrieval across the web.

(5) Supplementary code and data - looks complete and organized.

Specific comments:

1. Page 2, line 34 - it may be worth mentioning that Kahneman himself sees this as a misunderstanding of the systems, as he explained in the Montreal AI Debate 2020.

2. Page 4, reduced size of training data - the distinction between pre-training data and fine-tuning data is important (even the argument holds for both).

3. Section 2 - please include in the appendix the list of venues. In particular, I looked at the supplementary code and couldn’t find many mentions of ACL. I was surprised by the ratio of conferences and journals, since most NLP work is published in conferences.

5. Section 3.1.1 - why is sentiment analysis separate from text classification? Isn’t it a form of text classification?

6. Section 3.1.1 - isn’t KB completion by definition neuro-symbolic (assuming the model is neural)? I.e. the training data is symbolic.

7. Section 3.1.1 - If I understand correctly, linguistic structure is considered as a symbolic element. Did you search for specific terms to find works incorporating linguistic structure? I think it would be interesting to elaborate on that line of work as opposed to most (I assume) other works that rely on knowledge graphs etc.

8. Section 3.1.1 - in the last paragraph, the discussion about supervised neural models gaming tasks should include citations to relevant papers. For example, about image captioning [Szegedy et al., 2015], visual question answering [Agrawal et al., 2016], reading comprehension [Jia and Liang, 2017], and natural language inference [Poliak et al. 2018; Gururangan et al. 2018].

9. Page 15, line 26 - what about the transformer architecture?

Minor comments and typos:

- Page 2, line 44 - a lot intuitive sense -> a lot of intuitive sense
- Page 3, line 21 - missing word “hand”
- Page 6, line 48 - one -> uni
- Fig 6 (a) is missing
- Page 9, line 44 - one-to-one -> many-to-one
- Table 4 appears too early, a few pages before it is referred to
- Figure 18 - are the columns the proposed terms?
- Page 18, line 14 - discussed Section -> discussed in Section
- Page 18, line 19 - missing closing brackets
- Page 23, line 11 - agree -> agree on


[1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.

[2] Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering models. In Proceedings of the 2016 Confer- ence on Empirical Methods in Natural Language Processing, pages 1955–1960, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1203. URL

[3] Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehen- sion systems. In Proceedings of the 2017 Conference on Empirical Methods in Nat- ural Language Processing, pages 2021–2031, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1215. URL

[4] Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Ben- jamin Van Durme. Hypothesis only baselines in natural language inference.
References 13
In Proceedings of the Seventh Joint Conference on Lexical and Computa- tional Semantics, pages 180–191, New Orleans, Louisiana, June 2018. As- sociation for Computational Linguistics. doi: 10.18653/v1/S18-2023. URL

[5] Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bow- man, and Noah A. Smith. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2017. URL

Review #2
Anonymous submitted on 20/Mar/2022
Major Revision
Review Comment:

** General Review

I think this paper is interesting and very readable and the goal of this paper is important. However, I think the paper does not completely address the problem: sometimes it is too general in the way it describes the research and I also think it ignores many papers in the literature. I suggest the authors improve this work because I think the answer to this question (i.e., the contribution of neuro-symbolic to language) is very important.

I will give a general summary of this paper using the template provided for the review:

(1) Suitability as an introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

The text is currently not an introductory text for neuro symbolic approaches in the context of language. In general, some of the concepts are quickly introduced and not explained in detail (e.g., Universal Grammar is a very specific concept that not all researchers might know).

I understand that a complete survey is not the foucs of this paper, but I still think that a more detailed introduction to the neuro vs symbolic divide, in the context of language, would help in making this review more suitable as an introductory text.

There is a wide range of work that can be mentioned and compared: much of the original NLI work was symbolic/logic-based, see some references (,,

While recent papers tend to focus on the use of transformers-based deep neural networks (e.g., all the experiments on the SNLI, MNLI, ANLI datasets).

Some papers that combine neural and symbolic approaches are also present in the literature (

(2) How comprehensive and how balanced is the presentation and coverage.

The paper is not comprehensive if the aim is to answer questions about neuro-symbolic approaches and language. The paper does not provide many references that come from the linguistics community. Moreover, the works are not described in detail and it is sometimes difficult to get an overall picture of what are the contributions currently described in the literature. All in all, it is difficult to get an idea of what is the state-of-the-art of these technologies and for exactly what they are used (albeit I understand that many of these papers use different datasets).

(3) Readability and clarity of the presentation.

The paper is generally clear and readable.

(4) Importance of the covered material to the broader Semantic Web community.

The work is relevant to the community as it also targets the area of knowledge representation. Some of the works described in this paper use semantic web technologies.

** Some more detailed comments:

I truly appreciate the effort that has been put into designing this work. I appreciate the structural organization and the plots and I think they provide an interesting overview of what has been done in the literature.

As already mentioned in the introduction of this review, my main comment against this paper is that it does not answer its main claim. The authors include very few papers about natural language processing and I cannot find any paper from the ACL community (that I consider the main community for NLP work). The issue is also clear from the TF-IDF word clouds that are presented in Fig 6 since the words "text" or "language" do not come up in the abstract. Moreover, other important works (such as that combine language, symbols and neural networks are missing from this presentation. In general, I am a bit surprised no paper from the NLP conferences was included in this process. Searching "neuro-symbolic" in the ACL anthology returns many hits.

The discussion of the related work is very broad and general. The Neuro-Symbolic Concept Layer is mentioned but never explained. I would find it hard to understand it if I didn't know what it was. This would be probably fine if this paper wasn't submitted as a survey paper. However, I think that for a survey paper more details are needed.

** Format

Page 4: I am not sure why the authors have a simple bullet point list at the start and then have a subsection for each element. I think the authors can just start with the subsections.

Please check the references, some of them do not contain enough information (e.g, conference name missing).

Review #3
By Filip Ilievski submitted on 30/Mar/2022
Major Revision
Review Comment:

This paper asks whether neuro-symbolic methods provide the promised improvements in AI and ML, specifically out-of-domain generalization, transferability, reasoning, usage of less training data, and interoperability. This question is very timely, and, if addressed in a scientific document (a paper or a book) would be of greatest service to the community.

The paper lists five research questions that investigate the link between NeSy tasks, benchmarks, methods, and applications. The method is based on finding a pool of publications in scopus through keywords, selecting relevant and high-quality publications, and analysing them. The paper includes a discussion on the future directions and the limitations of the study.

While I appreciate the overarching question posed in the title and the abstract of this paper, I find a great discrepancy between this question and the questions actually investigated in the paper. The abstract claims that "more goals are being satisfied" by NeSy methods, yet, I am not sure how this finding is supported. The findings provided in the paper are about what is popular in NeSy work, and what promises are being made. I did not find evidence that such promises are (or are not) fulfilled.

This could be a matter of scoping. The title together with the five dimensions corresponds to a very large body of research, especially in the last several years. Providing a satisfactory answer to whether NeSy systems improve on, say, reasoning alone would require a much more thorough investigation. In the current shape, the paper merely shows the number of papers that work on these dimensions in relation to the considered task, application and method.

Another manifestation of the scoping problem lies in the small number of works analyzed (34). It is very likely that only in 2021 there have been more papers published on NeSy systems that focus on the five dimensions. A sample of 34 papers from any time cannot be expected to represent the entire body of research on this topic to a reasonable extent.

Another issue with this manuscript is that it is often colloquial about prior work. It does not include a related work section that analyses prior NeSy surveys. Prior surveys are mentioned but it is unclear whether their focus is similar to the focus of the present manuscript. Moreover, the paper has many occurrences of "X said Y" without a citation, which requires a citation. Stating that "We don't need a scholar to confirm ..." is also problematic - a scientific paper should provide support rather than assuming general knowledge about logic and common sense.

In 1.3 - are the category definitions devised by the authors or do they come from prior work? If the latter, these would need citations. If the former, this should be explicitly stated and justified. Similarly for the three XAI categories.

In table 1, how are the keywords compiled? They seem to miss very important recent keywords, like 'language models', 'knowledge graphs', 'self-supervision', 'knowledge augmentation', etc. I understand that a list will never be complete, but still the authors should explain how this keyword list was generated.

Time plays a key role for this paper. Asking for the 'current trends' might entail that the publication date should be limited to the last several years. Methods and tasks from 2011 are very likely to be outdated today.

The paper provides many figures, which is nice, but the explanation of the figures is minimal, and often not available at all. As these figures are crucial for the questions investigated in the paper, the authors must discuss them in more detail, and connect them clearly to the research questions. In fact, the paper does not seem to answer the question in the title, nor the five questions in section 2.1.

Minor comments:
* please add a Y-axis to figure 2
* the annotation task is critical for this paper, and i am not convinced that two annotators with a low agreement provide a good basis to perform such an ambitious analysis. This seems like a key limitation of this research.
* Fig 9 - should it be 'NLU' instead of 'NLP' in the caption?
* Another relevant survey that the authors might have missed:

Van Harmelen, F., & Teije, A. T. (2019). A boxology of design patterns for hybrid learning and reasoning systems. arXiv preprint arXiv:1905.12389.