External Knowledge Integration in Large Language Models: A Survey on Methods, Challenges, and Future Directions

Tracking #: 3835-5049

Authors: 
Itisha Yadav
Sirko Schindler
Diana Peters
Roman Klinger

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Survey Article
Abstract: 
Large language models (LLMs) have proved to be good in various natural language understanding (NLU) tasks. However, they face notable limitations like hallucinations, lack of contextual knowledge, and outdated or incomplete knowledge when applied across knowledge-intensive domains such as scientific research, biomedical sciences, finance, law, and others. These challenges commonly arise due to the scarcity and under-representation of domain-specific data during the training and model alignment phases, the latter being synonymous with reinforcement learning from human feedback (RLHF). Furthermore, LLMs struggle with providing nuanced expertise, as their internal knowledge remains static and generalized, hindering their ability to reason accurately or deliver context-aware results in specialized tasks. This survey investigates the integration of external knowledge into LLMs as a means to address these limitations. By investigating parametric and non-parametric approaches, this work discusses methods to enhance model reasoning capabilities, factual accuracy, and adaptability for domain-specific and knowledge intensive tasks. Additionally, it highlights the potential of external knowledge integration in improving explainability and ensuring more trustworthy outputs. This survey supports software developers and natural language processing (NLP) researchers in designing natural language understanding systems for specialized domains by leveraging pre-trained LLMs. Additionally the work provides a foundation for advancing LLM-based NLU systems with insights into future research areas.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Célian Ringwald submitted on 30/Jun/2025
Suggestion:
Minor Revision
Review Comment:

I recommend accepting this survey article, which is both timely and highly relevant to the field. It is a well-written and accessible overview that effectively synthesises the current challenges and limitations associated with knowledge-intensive tasks.

Nonetheless, I have some questions and comments to address to the authors before discussing the final decision.

(1) Intended Audience and Relevance:
This work clearly defines its intended audience. It is entirely suitable as an introductory text or for quickly becoming familiar with the current state of the art related to knowledge-intensive tasks.

(2) Coverage and Balance:

The paper is based on a clearly defined taxonomy of solutions proposed by the literature for integrating external knowledge into NLU systems. It systematically differentiates between parametric and non-parametric integration approaches.
- Strengths:
* The structure and the flow of the article are pleasant to read.
* The paper offers a high-level comparison of several very recent contributions, which is both refreshing and valuable.
- Areas for Improvement:
* the discussion of limitations is more detailed for parametric methods than for non-parametric ones. For instance, the "Constraint-Decoding" and "RAG" sections (pp. 11–12) would benefit from a more explicit analysis of their respective limitations.
* In the discussion of RAG methods, it was surprising to find here the only discussions related to concrete performance measures, where no results were provided to evaluate the previously described methods.

(3) Clarity and Readability

Q3-1. When reading the abstract, the keyword RLHF drew my attention; however, this notion is not discussed in the article's content. To address your audience, you need to clarify or delete this sentence.

Q3-2. I had the feeling that you proposed an analysis focused only on the decoder-based model from p5. it would be helpful to state it explicitly earlier in the text to clarify the scope.

Q3-3. This may also be a detail, but surveys mostly rely on primary research. I have the impression that most of the discussed statements are derived from the survey, so I am unsure whether that paper should be considered a tertiary study (as defined in https://legacyfileshare.elsevier.com/promis_misc/525444systematicreviews...). A short clarification of the article’s positioning in this regard would be appreciated.

Q3-4. Although the work does not claim to follow a rigorous systematic review protocol, the availability of scripts used for literature collection is valuable. I think that we may have some key insights to offer in the "Scope and literature survey Methodology" paragraph. (nb of papers taken into account, nb of surveys/articles/preprint)

(4) Relevance to the Semantic Web Community

From my perspective, the article is indeed relevant to the Semantic Web community, even though it does not explicitly cover core Semantic Web technologies such as RDF, OWL, or SHACL. It offers a macro-level analysis of the current challenges and opportunities shifted by LLM when it comes to dealing with Knowledge intensive tasks such as information extraction, question answering (as with SPARQL), knowledge graph construction and population.

(5) Shared ressources

Concerning the Long-term stable URL, I appreciate having access to the script. However, it could be better to also share the "recent_papers_comprehensive.csv" file obtained from this process. You didn't choose to propose a systematic review process, so it is not a shame to share a distilled and/or incomplete list of references.

(6) Stylistic and Minor Corrections
There are several minor typographical and grammatical issues that should be addressed, for example, adding commas after adverbs.
Below, I spotted some typo:
* p3-l.40: perforing > performing
* p6-l.37: "LLMs struggle with processing or generating up-to-date information for NLU task" > tasks
* p9-l.7: “Where was Abraham
Lincoln born?” but fail to > fails
* p13-l.11: "elastic search [22], chromaDB.. > Elastic Search + ChromaDB ?
* p14-l.25: Each of these techniques come with their advantages and disadvantages and therefore require future > Each of these techniques comes with its advantages and disadvantages and therefore requires

(7) Community Connection
It is worth noting that this survey aligns well with the themes explored in the GOBLIN E-COST initiative. Specifically, Task 3.2 of the GOBLIN project addresses both retrieval-augmented generation (RAG) and knowledge injection, mirroring the parametric and non-parametric dichotomy discussed in this paper. Sharing the constructed corpus and engaging with these community discussions would not only enhance the impact of this work but also help position yourself for future works.

Review #2
By George Hannah submitted on 29/Jul/2025
Suggestion:
Accept
Review Comment:

The authors provide an overview of current limitations present in Large Language Models (LLMs) as well as surveying a wide range of differing methodologies to address these limitations.

With the exception of the occasional minor typo, the paper is well written and easy to understand. The introduction clearly identifies and frames the limitations of LLMs in a way that can be understood by readers of varying backgrounds. For some methodologies discussed, the approach is described but not explained further. In these cases, a small expansion into how the method works would be beneficial for a reader. Additionally, in section 6, a comparative analysis of the approaches discussed would allow the paper to draw a clearer conclusion, and allow other researchers to more easily identify areas where a novel method can address current limitations. Figure 2 partially addresses this, but comparing the approaches in more dimensions than just hardware requirements and explainability would be preferable. The overview provided by the authors is fair and balanced as they clearly discuss the benefits and limitations of the methods surveyed.

When discussing hallucinations in section 3.1 an example is provided however, without the prompt that lead to that response, it is unclear why it is a hallucination. The need for a the prompt in this case is backed up in the paper, where to point is made that hallucinations are not necessarily negative depending on the domain and use-case of the LLM.

The authors make use of a python script to retrieve papers for this survey based on a set of relevant keywords. (For closed repositories the same process has been carried out by hand). The code has been provided alongside the paper as supplementary material. Whilst no README has been provided, the code is clear and well formatted so the purpose of each part of the code can be easily understood. A limitation of this method of collecting paper for the survey is that it relies on the assumption that the set of keywords used is exhaustive. In practice it is unlikely that this assumption is true meaning that there are other state-of-the-art approaches that have not been identified by this survey. However, I believe that the authors have done a good job in covering a significant portion of the area.

Overall I believe that this paper is well written, balanced, fairly comprehensive, and falls well within the scope of the call, so I recommend that this paper is accepted by the editors.

Review #3
Anonymous submitted on 09/Sep/2025
Suggestion:
Reject
Review Comment:

# Summary
This survey summarises research papers related to the integration of external knowledge into LLMs. The surveyed papers are retrieved based on a keyword search, without providing a detailed description of the methodology. The authors motivate the need for knowledge injection by focusing on hallucinations, outdated knowledge and domain-specific knowledge. The identified work is divided into parametric and non-parametric adaptations of LLMs. For parametric approaches, parameter-efficient fine-tuning (PEFT) methods are discussed. Additional methods are mentioned, but not clearly distinguished from PEFT. Non-parametric approaches are limited to brief descriptions of KG integration and retriever augmented generation methods.

# Strengths
A comprehensive survey on the broad topic of knowledge integration into LLMs is highly relevant. A good introductory text that structures and organises recent work, which is abundant in this area, would be valuable.  

# Weaknesses
This survey is not comprehensive and overlooks several key areas of integrating external knowledge, such as tool calling, reinforcement learning, or even simple prompting. Only considering a keyword search based on a "well-defined taxonomy", i.e., a random selection of keywords that look interesting (at least based on a peek at the actual keywords; the paper does not provide details on how they were selected). Additionally, this work contains several factual mistakes, is poorly structured (and as a result, repeats itself constantly), and is overall not well-written.  

# Review dimensions
* The work is not suitable as an introductory text due to its lack of clear structure, factual inaccuracies, frequent repetitions and overall poor language. For the intended audience of practitioners or developers, this survey fails to provide clear insights to guide the implementation of specific systems. For a research-based audience, the provided descriptions of particular methods are too generic.        
* The survey misses several well-known methods for knowledge integration of LLMs, e.g tool usage/function calling, knowledge distillation, mixture of expert models, any reinforcement learning, memory systems, or even simple prompting is not discussed. The discussion of specific methods is limited to a high-level description without any discussion of related experiments, and the comparison between methods is not based on empirical data.    
* Many repetitions, grammatical errors and a lack of focus on the discussed topic make this paper particularly difficult to read. The provided figures do not improve the clarity.
* The basic topic of knowledge integration is relevant for the NLP and broader AI community. In particular, the brief discussion of non-parametric integration of KGs is relevant to the broader Semantic Web community. In general, a comprehensive survey would be of great value. However, this paper only covers a small amount of work relevant to this topic.
* The provided resource is a single Python script, which uses the APIs of several repositories and search engines for scientific literature to search for specific predefined terms. The results of this search are not provided, and it is unlikely that the script can reproduce the same results used for the survey. The script does not use a fixed date range for the search (it only considers the last 5 years, i.e. the search results are already different next year). Additionally, the date filter is only partially applied. For DBLP and Google Scholar, it is ignored, and the year 2021 is hard-coded for IEEE Xplore. It is published on Zenodo, but does not contain a README, and no data is provided.

# Comments
* SQL and SPARQL require an LLM to use some knowledge about the database / KG schema, but why is this specific use case discussed in such detail? Additionally, for the use case of knowledge integration, SQL and SPARQL pose somewhat similar problems; it is not necessary to discuss both in detail (without referring to specific challenges of both)
* "There are survey papers that [...]" page 3, line 34. These papers are not cited as part of this sentence. It is better to discuss a specific paper and cite it, rather than merely informing the reader that it exists. Additionally, these refer to other survey papers on knowledge integration; it would be beneficial to more clearly distinguish the focus of these papers from the submitted survey.  
* The methodology of the literature review is not comprehensive. It is not discussed how the list of search phrases is compiled and why it should be well-defined. The list itself contains keywords like "LLMs as Zero-Shot Learners", "Addressing Hidden Risks in LLMs" and "LLMs in Manufacturing", which seem to be quite specific but are not obviously related to LLMs and external knowledge. It is not stated how many papers are considered, nor are any other statistics provided based on this method. The literature review could have used snowballing to achieve a more comprehensive literature review. In its current state, a recognisable methodology for this survey is not apparent.
* Sections 2 and 3, as motivation for knowledge injection, are too long. The motivation could be included in the introduction in a condensed form. An overview of NLU systems and a lengthy discussion of LLM challenges may not be the most suitable approach for this survey paper. The first methods for knowledge integration are discussed on page 8 of 16.  
* "LLMs may generate logically corrrect strings that appear valid" page 5 line 45. What definition of logically correct is applied in this paragraph? If something is logically correct, i.e. it logically follows from a set of premises, it hardly qualifies as a hallucination. If the model considers different premises, the actual task of logically reasoning with the given, correct premises would not be logically correct either.
* "As a result, identical inputs do not always yield the same output when prompted repeatedly." page 7 line 8-9 and "LLMs do[es] not guarantee deterministic results" page 8 line 17. This is wrong! At the core, LLMs are deterministic, i.e. once the weights are trained, they will produce the same output given the same input. Many LLM applications introduce non-deterministic sampling as a feature, but parameters like temperature or top-p can be used to configure this. It does not make sense to claim this as an issue of LLMs. Repeatedly making wrong claims about the fundamental architecture of LLMs is not acceptable for a good survey paper.  

* Sections 4.1.1, 4.1.2 and 4.1.3 describe all more or less the same concepts of PEFT. In all sections, it is mentioned that only a subset of parameters of LLMs needs to be fine-tuned. What is the difference between these methods? When does it make sense to use what approach?
* In 4.1.1 knowledge extraction with LLMs is discussed at length (again), but it is unclear how this should be related to pre-training or even why it a relevant part of the survey in general.
* The relation between PEFT, LoRA and Wang et al. [103] does not make sense. PEFT is a category of approaches, and LoRA (and [103]) are methods of this category. The authors could restructure this paragraph to highlight this relation between the methods.
* If steering can be done in a parametric and non-parametric setting, why is it part of the non-parametric section of this work?  
* In section 4.2.1, LLM-augmented KGs are discussed, which do not quite fit the topic of knowledge integration in LLMs. These methods deal with the extraction of knowledge (again).
* Constraint-Decoding in LLMs is not able to apply ontological rules. Constraint-decoding operates on the syntax level, i.e., it can ensure that the generated tokens adhere to a specific syntax, but ontological rules are more complex.

## Minor comments
* "[LLMs] have proven to be good in various [...]" page 1 line 20. Proving something is a strong statement, particularly in a scientific paper. So far, we have empirical evidence that LLMs are good at NLU tasks.  
* "Unlike scientific expert analyses, LLMs struggle [...]" page 2 line 23. This sentence does not make sense. It should be rephrased to be understandable.
* "eliminating the need for recursion or convolution" page 5 line 3. These techniques are still frequently applied even for NLP and NLU tasks. It is better to phrase this differently.  
* "The attention mechanism understand [...]" page 5 line 4 and "So as to educate the model [...]" page 9 line 26. The attention mechanism does not understand. Anthropomorphising LLMs is not helpful if we want to understand what these models are actually doing. While I acknowledge that many researchers commonly do this, it is not a good style, and the survey would be better if the authors avoided it.  

# Score
This work should not be published in its current form. It contains several factual mistakes, and the survey itself is far from a comprehensive, well-structured survey about knowledge integration in LLMs. I recommend that the authors focus their work on specific aspects of knowledge integration instead. A study about the whole area might be too ambitious.