Review Comment:
The paper presents a reranking framework for Knowledge Graph Question Answering (KGQA) that identifies the shortest-path subgraphs in Wikidata connecting question entities to potential answer candidates. It extracts various features, including graph, text, and Graph2Text features, and trains rankers using logistic and linear regression, CatBoost, and an MPNet-based sequence ranker. This approach aims to improve the Hits@N metric over candidates generated by large language models on the Mintaka dataset. Additionally, the paper includes a web application for visualizing subgraphs and providing graph-to-text explanations. It also shares code and guidance to datasets and models, with the goal of enhancing the interpretability and reproducibility of the entire pipeline.
The paper is readable and mostly well-structured, but there are recurring typos and phrasing issues (e.g., “reprehensibility” vs “reproducibility,” inconsistent capitalization, minor grammatical errors) that should be corrected for clarity and polish in a journal version. Figures and tables are informative, although some legends and axis labels could be improved to better emphasize absolute and relative gains, thereby guiding readers to the key takeaways more quickly.
The pipeline consists of candidate generation through diverse beam search using fine-tuned large language models (LLMs), subgraph extraction from a local Wikidata dump with igraph, and feature engineering, which includes graph statistics, text concatenations, and Graph2Text variants. The training and evaluation of a ranker is done using Hits@N.
It is notable that text and Graph2Text signals are the most important features, with PageRank being a key graph feature. This suggests that textual data is more influential than structural features, underscoring the need for further studies on the role of subgraph structure in comparison to text alone, as well as testing on out-of-domain knowledge graph question answering (KGQA) benchmarks.
Strengths
+Clear and modular pipeline featuring subgraph mining, feature extraction, and multiple rankers, allowing for careful comparisons and analyses of feature importance across candidate LLMs.
+Consistent empirical progress on Mintaka, with thorough reporting across feature types and models, along with a practical web tool to visualize subgraphs and provide graph-to-text explanations.
+Insightful observations regarding feature importance, such as the contribution of PageRank and the prominence of text/Graph2Text signals, can guide future hybrid designs.
Weaknesses and required revisions
-Limited evaluation scope: primarily focused on Mintaka, excluding yes/no and count types. Add at least one more standard KGQA dataset and report on all question types, including rationale and handling strategy for each.
-Baseline coverage: compare with more advanced agentic/reasoning or retrieval-reranking KBQA systems in addition to classic baselines to better contextualize gains, and incorporate relevant non-LLM graph-centric baselines.
-Structural versus textual contributions: Provide detailed ablation studies demonstrating how structural features alone improve performance compared to text-only ranking. This should include stress tests where text similarity may be misleading, but graph evidence assists in disambiguation.
-Reproducibility package: release precomputed subgraphs, Graph2Text outputs, configuration files, seeds, trained rankers, and figure generation notebooks as a versioned archive with a stable DOI. Include a single reproducibility script to regenerate key tables and figures from the artifacts.
-Revise for typos and inconsistencies, enhance figure captions to highlight absolute improvements and include error bars where relevant, and streamline the contributions section to reduce overlap with the phrasing of prior work.
|