Review Comment:
# Summary
This survey summarises research papers related to the integration of external knowledge into LLMs. The surveyed papers are retrieved based on a keyword search, without providing a detailed description of the methodology. The authors motivate the need for knowledge injection by focusing on hallucinations, outdated knowledge and domain-specific knowledge. The identified work is divided into parametric and non-parametric adaptations of LLMs. For parametric approaches, parameter-efficient fine-tuning (PEFT) methods are discussed. Additional methods are mentioned, but not clearly distinguished from PEFT. Non-parametric approaches are limited to brief descriptions of KG integration and retriever augmented generation methods.
# Strengths
A comprehensive survey on the broad topic of knowledge integration into LLMs is highly relevant. A good introductory text that structures and organises recent work, which is abundant in this area, would be valuable.
# Weaknesses
This survey is not comprehensive and overlooks several key areas of integrating external knowledge, such as tool calling, reinforcement learning, or even simple prompting. Only considering a keyword search based on a "well-defined taxonomy", i.e., a random selection of keywords that look interesting (at least based on a peek at the actual keywords; the paper does not provide details on how they were selected). Additionally, this work contains several factual mistakes, is poorly structured (and as a result, repeats itself constantly), and is overall not well-written.
# Review dimensions
* The work is not suitable as an introductory text due to its lack of clear structure, factual inaccuracies, frequent repetitions and overall poor language. For the intended audience of practitioners or developers, this survey fails to provide clear insights to guide the implementation of specific systems. For a research-based audience, the provided descriptions of particular methods are too generic.
* The survey misses several well-known methods for knowledge integration of LLMs, e.g tool usage/function calling, knowledge distillation, mixture of expert models, any reinforcement learning, memory systems, or even simple prompting is not discussed. The discussion of specific methods is limited to a high-level description without any discussion of related experiments, and the comparison between methods is not based on empirical data.
* Many repetitions, grammatical errors and a lack of focus on the discussed topic make this paper particularly difficult to read. The provided figures do not improve the clarity.
* The basic topic of knowledge integration is relevant for the NLP and broader AI community. In particular, the brief discussion of non-parametric integration of KGs is relevant to the broader Semantic Web community. In general, a comprehensive survey would be of great value. However, this paper only covers a small amount of work relevant to this topic.
* The provided resource is a single Python script, which uses the APIs of several repositories and search engines for scientific literature to search for specific predefined terms. The results of this search are not provided, and it is unlikely that the script can reproduce the same results used for the survey. The script does not use a fixed date range for the search (it only considers the last 5 years, i.e. the search results are already different next year). Additionally, the date filter is only partially applied. For DBLP and Google Scholar, it is ignored, and the year 2021 is hard-coded for IEEE Xplore. It is published on Zenodo, but does not contain a README, and no data is provided.
# Comments
* SQL and SPARQL require an LLM to use some knowledge about the database / KG schema, but why is this specific use case discussed in such detail? Additionally, for the use case of knowledge integration, SQL and SPARQL pose somewhat similar problems; it is not necessary to discuss both in detail (without referring to specific challenges of both)
* "There are survey papers that [...]" page 3, line 34. These papers are not cited as part of this sentence. It is better to discuss a specific paper and cite it, rather than merely informing the reader that it exists. Additionally, these refer to other survey papers on knowledge integration; it would be beneficial to more clearly distinguish the focus of these papers from the submitted survey.
* The methodology of the literature review is not comprehensive. It is not discussed how the list of search phrases is compiled and why it should be well-defined. The list itself contains keywords like "LLMs as Zero-Shot Learners", "Addressing Hidden Risks in LLMs" and "LLMs in Manufacturing", which seem to be quite specific but are not obviously related to LLMs and external knowledge. It is not stated how many papers are considered, nor are any other statistics provided based on this method. The literature review could have used snowballing to achieve a more comprehensive literature review. In its current state, a recognisable methodology for this survey is not apparent.
* Sections 2 and 3, as motivation for knowledge injection, are too long. The motivation could be included in the introduction in a condensed form. An overview of NLU systems and a lengthy discussion of LLM challenges may not be the most suitable approach for this survey paper. The first methods for knowledge integration are discussed on page 8 of 16.
* "LLMs may generate logically corrrect strings that appear valid" page 5 line 45. What definition of logically correct is applied in this paragraph? If something is logically correct, i.e. it logically follows from a set of premises, it hardly qualifies as a hallucination. If the model considers different premises, the actual task of logically reasoning with the given, correct premises would not be logically correct either.
* "As a result, identical inputs do not always yield the same output when prompted repeatedly." page 7 line 8-9 and "LLMs do[es] not guarantee deterministic results" page 8 line 17. This is wrong! At the core, LLMs are deterministic, i.e. once the weights are trained, they will produce the same output given the same input. Many LLM applications introduce non-deterministic sampling as a feature, but parameters like temperature or top-p can be used to configure this. It does not make sense to claim this as an issue of LLMs. Repeatedly making wrong claims about the fundamental architecture of LLMs is not acceptable for a good survey paper.
* Sections 4.1.1, 4.1.2 and 4.1.3 describe all more or less the same concepts of PEFT. In all sections, it is mentioned that only a subset of parameters of LLMs needs to be fine-tuned. What is the difference between these methods? When does it make sense to use what approach?
* In 4.1.1 knowledge extraction with LLMs is discussed at length (again), but it is unclear how this should be related to pre-training or even why it a relevant part of the survey in general.
* The relation between PEFT, LoRA and Wang et al. [103] does not make sense. PEFT is a category of approaches, and LoRA (and [103]) are methods of this category. The authors could restructure this paragraph to highlight this relation between the methods.
* If steering can be done in a parametric and non-parametric setting, why is it part of the non-parametric section of this work?
* In section 4.2.1, LLM-augmented KGs are discussed, which do not quite fit the topic of knowledge integration in LLMs. These methods deal with the extraction of knowledge (again).
* Constraint-Decoding in LLMs is not able to apply ontological rules. Constraint-decoding operates on the syntax level, i.e., it can ensure that the generated tokens adhere to a specific syntax, but ontological rules are more complex.
## Minor comments
* "[LLMs] have proven to be good in various [...]" page 1 line 20. Proving something is a strong statement, particularly in a scientific paper. So far, we have empirical evidence that LLMs are good at NLU tasks.
* "Unlike scientific expert analyses, LLMs struggle [...]" page 2 line 23. This sentence does not make sense. It should be rephrased to be understandable.
* "eliminating the need for recursion or convolution" page 5 line 3. These techniques are still frequently applied even for NLP and NLU tasks. It is better to phrase this differently.
* "The attention mechanism understand [...]" page 5 line 4 and "So as to educate the model [...]" page 9 line 26. The attention mechanism does not understand. Anthropomorphising LLMs is not helpful if we want to understand what these models are actually doing. While I acknowledge that many researchers commonly do this, it is not a good style, and the survey would be better if the authors avoided it.
# Score
This work should not be published in its current form. It contains several factual mistakes, and the survey itself is far from a comprehensive, well-structured survey about knowledge integration in LLMs. I recommend that the authors focus their work on specific aspects of knowledge integration instead. A study about the whole area might be too ambitious.
|