Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang, Shijie Zhang, Wei Qiang Zhang, Yongfeng Huang, Haixin Duan, Yunqi Zhang

TL;DR
Explore-on-Graph (EoG) is a new framework that uses reinforcement learning to enable large language models to autonomously explore knowledge graphs, improving reasoning diversity and accuracy in question-answering tasks.
Contribution
EoG introduces a reinforcement learning approach with path-based rewards to promote autonomous exploration of knowledge graphs by LLMs, surpassing prior constrained methods.
Findings
Achieves state-of-the-art results on five KGQA benchmarks.
Outperforms both open-source and closed-source LLMs.
Enhances reasoning diversity and correctness through exploration.
Abstract
The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks. A promising solution is to ground LLMs' answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during generation or by imitating paths from a fixed set of demonstrations. However, they naturally confined the reasoning patterns of LLMs within the scope of prior experience or fine-tuning data, limiting their generalizability to out-of-distribution graph reasoning problems. To tackle this problem, in this paper, we propose Explore-on-Graph (EoG), a novel framework that encourages LLMs to autonomously explore a more diverse reasoning space on KGs. To incentivize exploration and discovery of novel reasoning paths, we propose to introduce…
Peer Reviews
Decision·ICLR 2026 Poster
* **Clear motivation & problem framing.** The paper articulates why rule/imitation approaches struggle on OOD patterns and positions exploration as the missing capability. Figure 1 illustrates this vividly. * **Method is simple, modular, and reproducible in principle.** The rewards (answer F1; path triple-match ratio) are transparent and plug into a standard GRPO objective with KL control. * **Strong empirical results across diverse KGQA datasets** with consistent gains vs. strong baselines;
1. **Potential reward gaming / verification gap.** The **path reward** credits substring co-occurrence of `(subject, relation, object)` tokens in `<think>` text rather than **verified KG traversals**. This leaves room for *verbalization without execution* (i.e., asserting triples to earn reward). The paper should either (a) execute the predicted path against the KG to produce a structural match reward, or (b) at least audit hallucinated triples vs. KG edges. 2. **LLM-judge reliance for qualitat
1. The motivation is well grounded. Not only the answer but also the reasoning path can server as good reward signals. 2. The experiments show strong results. For example, table 1 show EoG outperforms not only open-source but also even closed-source LLMs, and table 4 show strong results on OOD settings. The improvement is significant.
1. Some implementation details are not clear. Are phase 1: Outcome Reward Modeling and phase 2: Path-refined Reward Modeling implemented sequentially or simultaneously (as in Equation 5)? 2. Reproducibility: The code is currently unavailable, which hinders verification, reproduction, and improvement efforts. Open-source code is crucial for these processes.
1. This method outperforms powerful closed-source models such as GPT-5 and Gemini-2.5 Pro, which is impressive. 2. The paper is well written, with clear motivation and problem formulation, and provides a good example (Figure 1) illustrating the limitation of existing approaches and the EoG methods. 3. Smaller open-source LLM trained by EoG can compete with larger closed-source ones, which means EoG addresses some of the current compute resource limitations. 4. Well-organized experiment, ablation
1. Limited technical novelty: The core contribution combines existing techniques (SFT, GRPO, simple reward design) without significant algorithmic innovation. The path reward is particularly simplistic, using only substring matching. In the area of KG reasoning, using reinforcement learning to explore path is a general and common practice. Check MINERVA [1] and DeepPath [2]. 2. The reliance on Gemini 2.5 Flash for dataset generation creates a dependency that may limit reproducibility. The paper
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications
