Reranking Answers of Large Language Models with Knowledge Graphs

Tracking #: 3798-5012

Authors: 
Mikhail Salnikov
Hai Le
Olga Tsymboi
Ivan Lazichny
Dmitrii Iarosh
Egor Cheremiskin
Andrey Savchenko
Dmitry Simakov
Alexander Panchenko

Responsible editor: 
Guest Editors 2025 LLM GenAI KGs

Submission type: 
Full Paper
Abstract: 
Answering natural language questions over knowledge graph data is challenging due to the vast number of facts, which can be difficult to process and navigate. One potential solution for this issue is to use mined subgraphs related to the query, although this process still requires extracting these subgraphs. This research presents a solution for extracting subgraphs related to entity candidates from a question-and-answer set, which can be obtained by inferring a large language model by calculating the shortest paths between entities. The proposed approaches detail various features that can be extracted from the subgraphs and reranking models to select the most probable answers from a list of candidates. Experiments were conducted on Wikidata to evaluate the effectiveness of the proposed approaches. This involved enumerating all the main feature types that can be extracted from mined subgraphs and a detailed analysis of the proposed features and reranking method combinations. In addition, a public web application that provides a useful web tool for studying the graph space between question and answer entities has been developed to work with subgraphs. This includes visualization of the extracted subgraph and automatic generation of natural language text to describe it.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 20/Jun/2025
Suggestion:
Major Revision
Review Comment:

Summary:

Asking an LLM to answer a factoid question can lead to undesirable results. However, if one asks the LLM to output multiple different answers, the correct one is likely to be in the top N. This work designs a reranking approach for the LLM’s answers based on subgraphs extracted from a Knowledge Base. Concretely, it proposes using subgraphs extracted from the entities in the question and the candidate answer as input of the reranker. Therefore, the work proposes a method on how to create/prune these subgraphs, and also how to extract features from them. These features are graph metrics or the vectors generated from embedding the output of graph to text methods. Different ranking strategies are assessed: semantic-based, regression-based and neural-based.
The paper is an extension of a previous work. The new contributions of the submitted manuscript are the usage/extraction of different sets of features and the comparison between them.

Comments:

The Related Work section has been structured as a Background section, as it contains definitions of Knowledge Graphs and Knowledge Graph Question Answering. Given that the work focuses on the reranking of LLMs’ answers by using subgraphs, the Related Work section lacks discussion about these particular topics. This is, it should be expanded with different reranking techniques, different subgraph generation methods and focus on the extraction of features from graphs.

One of the main contributions stated in the manuscript is the proposal of a novel approach for KGQA. However, the approach is presented as a combination of different steps, each with various implementations, lacking a final concise vision. Additionally, given that the relevant novel contribution is on the approach for generating and exploiting the subgraphs, other methods should have been assessed and compared in the experiments section.

The second main contribution is the comparison of the proposed reranker with well-known reranking techniques. However, the only comparison has been done against a simple semantic ranker. More advanced reranking techniques should be assessed in the experiments.

The methods and experiments sections include detailed discussions of certain components that, while informative, are not central contributions of the reranker and may benefit from a more focused presentation.

In short, I think that in the current state the manuscript is not suitable for publication, and I suggest a Major Revision. First of all, the Related Work section should contain other works that focus on the reranking LLMs answers and on the generation of subgraphs. Experiments with these other methods should be performed and discussed in the experiments section. Additionally, the methods and experiments sections could be strengthened by placing greater emphasis on the novel reranker contribution, and streamline discussion of less critical parts.

Minor details:
Throughout the manuscript, there are instances of inconsistent or incorrect formatting when referencing figures, tables, algorithms, and appendices.

Review #2
Anonymous submitted on 25/Jul/2025
Suggestion:
Minor Revision
Review Comment:

In this paper, the authors propose their work on a re-ranking layer for Knowledge Graph Question Answering. More specifically, starting from the answer candidates list produced by a Language Model, the authors propose several re-ranking techniques based on three kinds of features: the graph’s structural features, text features, and graph-to-text features. The code, pre-trained models and detailed instructions are available for further research.

The paper is overall well-written and easy to follow. The structure is clear, as well as tables and figures, which are informative.
Content-wise, some areas need some strengthening. I think the authors should expand the Related Work section, especially the part regarding KGQA, since a large number of papers and surveys are available on this topic.
Additionally, I would have included a section focusing specifically on current limitations and future work, as it could help better locate the authors’ work and contribution in this field.

As stated earlier, the authors have made their work available in a GitHub repository, which appears to be well-structured and can guide users and future researchers in continuing their work and reproducing the results. However, I would explicitly define the hardware requirements needed in appropriate sections of both the repository and the paper, so this information can be quickly accessed.

The approach appears to be sound, and it offers a novel synthesis, though the novelty stems more from the integration rather than the algorithmic invention.

I appreciate the ablation study and the thorough evaluation made by the authors, yet a statistical significance test is absent from the results. Consider adding them to complete the work.
Furthermore, I feel a comparison with other systems available at the state of the art would have been beneficial.
Since many real-world pipelines cannot rely on perfect entity linking, another important analysis to include would be to test the results in context with off-the-shelf linkers. I understand that the authors wanted to avoid cascading errors. However, in real-world case scenarios, this is a crucial step that can significantly influence the decision to adopt the proposed method. Adding this analysis can help to understand how the system performs in real-world scenarios.

Review #3
Anonymous submitted on 02/Oct/2025
Suggestion:
Major Revision
Review Comment:

The paper presents a reranking framework for Knowledge Graph Question Answering (KGQA) that identifies the shortest-path subgraphs in Wikidata connecting question entities to potential answer candidates. It extracts various features, including graph, text, and Graph2Text features, and trains rankers using logistic and linear regression, CatBoost, and an MPNet-based sequence ranker. This approach aims to improve the Hits@N metric over candidates generated by large language models on the Mintaka dataset. Additionally, the paper includes a web application for visualizing subgraphs and providing graph-to-text explanations. It also shares code and guidance to datasets and models, with the goal of enhancing the interpretability and reproducibility of the entire pipeline.

The paper is readable and mostly well-structured, but there are recurring typos and phrasing issues (e.g., “reprehensibility” vs “reproducibility,” inconsistent capitalization, minor grammatical errors) that should be corrected for clarity and polish in a journal version. Figures and tables are informative, although some legends and axis labels could be improved to better emphasize absolute and relative gains, thereby guiding readers to the key takeaways more quickly.
The pipeline consists of candidate generation through diverse beam search using fine-tuned large language models (LLMs), subgraph extraction from a local Wikidata dump with igraph, and feature engineering, which includes graph statistics, text concatenations, and Graph2Text variants. The training and evaluation of a ranker is done using Hits@N.
It is notable that text and Graph2Text signals are the most important features, with PageRank being a key graph feature. This suggests that textual data is more influential than structural features, underscoring the need for further studies on the role of subgraph structure in comparison to text alone, as well as testing on out-of-domain knowledge graph question answering (KGQA) benchmarks.

Strengths
+Clear and modular pipeline featuring subgraph mining, feature extraction, and multiple rankers, allowing for careful comparisons and analyses of feature importance across candidate LLMs.
+Consistent empirical progress on Mintaka, with thorough reporting across feature types and models, along with a practical web tool to visualize subgraphs and provide graph-to-text explanations.
+Insightful observations regarding feature importance, such as the contribution of PageRank and the prominence of text/Graph2Text signals, can guide future hybrid designs.

Weaknesses and required revisions
-Limited evaluation scope: primarily focused on Mintaka, excluding yes/no and count types. Add at least one more standard KGQA dataset and report on all question types, including rationale and handling strategy for each.
-Baseline coverage: compare with more advanced agentic/reasoning or retrieval-reranking KBQA systems in addition to classic baselines to better contextualize gains, and incorporate relevant non-LLM graph-centric baselines.
-Structural versus textual contributions: Provide detailed ablation studies demonstrating how structural features alone improve performance compared to text-only ranking. This should include stress tests where text similarity may be misleading, but graph evidence assists in disambiguation.
-Reproducibility package: release precomputed subgraphs, Graph2Text outputs, configuration files, seeds, trained rankers, and figure generation notebooks as a versioned archive with a stable DOI. Include a single reproducibility script to regenerate key tables and figures from the artifacts.
-Revise for typos and inconsistencies, enhance figure captions to highlight absolute improvements and include error bars where relevant, and streamline the contributions section to reduce overlap with the phrasing of prior work.