A new retrieval-augmented generation (RAG) approach for querying and constructing large-scale knowledge graphs

Tracking #: 3854-5068

Authors: 
Nilay Tufek Ozkaya
Burak Yigit Uslu
Valentin Philipp Just
Tathagata Bandyopadhyay
Aparna Saisree Thuluva
Marta Sabou
Allan Hanbury

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
Large Language Models (LLMs) have demonstrated remarkable capabilities in extracting knowledge from and generating new content based on various types of resources, particularly text-based ones. Besides unstructured data, LLMs have also shown promising results when leveraging structured but semantically complex resources such as ontologies, schemas, and knowledge graphs. However, the practical use of large-scale semantic artifacts as direct input to LLMs is constrained by prompt size and token limitations. To address this issue, it is necessary to employ Retrieval-Augmented Generation (RAG) systems to preprocess and segment these large resources effectively. In this paper, we propose a novel RAG-based architecture, which includes LLM-based Named Entity Recognition and Disambiguation (NERD) and Entity Linking (EL) solutions, tailored for large-scale semantic artifacts, using OPC UA information models—an industrial standard—as a foundation. Within this framework, we implement and evaluate three distinct use cases that combine LLMs with the proposed RAG system: (i) semantic artifact validation, (ii) information retrieval, and (iii) information model generation. Each use case demonstrates strong performance, achieving F1-scores of up to 100\%, thereby validating the effectiveness of the approach. Furthermore, we evaluate the generalizability of the system across two different domains, confirming its robustness and applicability in diverse industrial contexts.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Birgitta Koenig-Ries submitted on 02/Jun/2025
Suggestion:
Minor Revision
Review Comment:

So here are my comments on the paper:

I found the paper a good, interesting read. It proposes an innovative approach to combining LLMs and KGs and offers a thorough evaluation.

I could not find information about the availability of data, code, or evaluation results. Without this, a final decision about the paper is impossible for me. Thus, please provide access to these artefacts (for reviewers, but also the readers of the paper).

Beyond that, I do have a few rather minor suggestions for improvement:

p.2. l.7: Can you describe how the token limits compare to KG sizes? (so how many tokens would be needed for your entire KG)?

p.2., l.34: not sure GraphQL is the most relevant reference here. What about e.g. Cypher?

p.3., l.1: I don't see a clear difference between RQs 1 and 2. Provide more detail or merge.

p.4, l. 13: What about the SPARQL extension of SHACL?

p.4, l. 37: is multi-modality exploited in any way or do you focus on the text?

p.5, Fig.1 : Talking about master-slave is somewhat outdated and contested. I suggest to change the example.

p.7, l. 26ff: Do you have evidence for that claim?

p.9, l15: typo in the figure (informaiton)

p. 9, l38 (The label...): I don't understand that sentence

p.10, l. 26: how do you prune?

p. 12, Algo 1: The algo assumes that the shortest path represents the connection between the entities the user is interested in. That is not necessarily the case (in particular if user queries are not very verbose and it is difficult to guess the intent of the users). Can you deal with that?

p.12, 38: how do you prioritize

p.16, l2: where can I find the rules?
p.16, Table 5: in addition to overall results, it would be interesting to see metrics for the individual questions.

Comment to the editors: I did not find the "Long-term stable URL for resources" - but maybe I simply didn't know where to look.

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

Review #2
Anonymous submitted on 28/Aug/2025
Suggestion:
Major Revision
Review Comment:

This paper proposes a framework for retrieval augmented generation based on large-scale knowledge graphs, integrating different components such as named entity recognition, entity linking, and integration with GPT-x models. An evaluation is provided using multiple LLMs, comparing performance in a given use case in automation.

In this work the authors chose to use OPC UA as underlying information standard. While this standard is used in industrial environments, the general approach about RAG using large scale knowledge graphs clearly exceeds this domain and could potentially be used in entirely different contexts. Therefore, the use or the motivation of using OPC UA is not very clear. It is said that it is simply one use-case in which it was decided to put the focus of this paper, but then the approach has a generalizable nature that is not exploited. In this sense, it would be preferable to put OPC UA in the background and limit it only to the use case, because otherwise it may confuse the reader making one think that this approach is for some reason strongly linked to that particular standard. This mismatch is particularly noticeable in the research questions. In none of them is there any mention of automation or industry standards. The research questions focus on the core of this paper's contribution, which is the RAG approach with large KGs.
Apart from this point, the paper describes with quite some details the proposed approach in which a prominent place is given to the so-called retrieval step, in which the subgraph process is included. This is essential for the augmentation, as portions of the original large KG are extracted. Although this explanation seems in principle easy to follow, the authors have not really spent much effort providing a solid formalization of this approach. This limitation is repeated throughout the paper. Different simplified explanations are given, along with some examples, but at no point we have a clear formalization of the research problem, nor a formalization of the proposed approach. It would be important to work on these aspects, which would provide more solid foundations for the proposed techniques, as well as strengthen the novelty claims of the paper.
Another important discussion point is related to the use cases and experiments. For instance the validation use-case seems to be a rather simple task that appears to be not so challenging after all. The evaluation at the end simply shows the effectiveness of gpt4o over gpt4, which says very little about the approach itself, and more about the strengths of a specific version of an LLM model over another. The authors may need to think about better ways to showcase the contributions of their proposed approach.
Similarly for the information retrieval the experiments simply show that for the selected NLQs the system essentially generates the expected queries. These results are hard to interpret as it looks more like the task was easy enough for the model, and it's not clear what is the significative advancement with respect to the state of the art. It would be advisable to reformulate the experiments in a way that the advantages of the proposed approach are highlighted over existing alternatives, or that the tasks given are shown to be challenging enough so that previous attempts with baselines show clear shortcomings.
It would also be expected to have a more diverse set of data models, given that otherwise there is a risk of over-specialization of the approach.
Finally, it would be important to better explore the cases in which, for example the information retrieval experiment, this generation approach fails. In the appendix there are some cases of failure which perhaps require further study. Why are some of this variations more problematic? It appears that entity matching starts to fail when variations are less subtle. In this context, it seems that a more complete evaluation also integrating the parts that are shown in the appendices, might need to be consolidated in the main paper. In general, I would suggest a generally more comprehensive and ambitious evaluation to strengthen the claims of this work.

Review #3
By Antonello Meloni submitted on 25/Sep/2025
Suggestion:
Minor Revision
Review Comment:

Overall assessment:
This manuscript presents a compelling and well-structured RAG-based framework for integrating LLMs with large-scale knowledge graphs. The approach is innovative, combining NERD, entity linking, graph modularization, KG validation, information retrieval, and semantic artifact generation. The evaluation on OPC UA Robotics and PackML KGs demonstrates strong performance, with F1-scores up to 100%, and the writing is clear and accessible. Overall, the paper makes a significant contribution to the field of knowledge graph-based information retrieval and generation.

Specific comments and minor concerns:

Token limit during retrieval and subgraph extraction:

In the retrieval phase, the system constructs a prompt containing all entities from the knowledge graph to match against the NLQ. While effective for the evaluated KGs, it is conceivable that token limits could constrain the matching process in substantially larger graphs and potentially affect downstream SPARQL query generation.

During subgraph extraction and expansion, symbolic rules are applied to include intermediate nodes, class instances, and related entities. The paper notes that the token limit may prevent full expansion, but it is not clear how priorities are determined or which nodes may be omitted. Clarification on these mechanisms would help readers understand potential trade-offs in completeness and robustness.

Repository and data artifacts:

The GitHub repository contains only two Excel files with the prepared NLQ sets. While these files are well structured, the repository does not include any other materials necessary to fully replicate the experiments. Providing these artifacts would greatly enhance reproducibility.

The repository choice (GitHub) is appropriate for discoverability and long-term access, but the completeness of data artifacts is currently limited.

Conclusion:
The manuscript is convincing and makes a strong contribution to the field. The concerns raised are relatively minor and mostly concern scalability considerations and completeness of the supporting data. Addressing these points would further strengthen the paper and support its reproducibility.