Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration

Jingtong Gao; Ling Pan; Yejing Wang; Rui Zhong; Chi Lu; Maolin Wang; Qingpeng Cai; Peng Jiang; Xiangyu Zhao

arXiv:2505.17621·cs.LG·February 2, 2026

Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration

Jingtong Gao, Ling Pan, Yejing Wang, Rui Zhong, Chi Lu, Maolin Wang, Qingpeng Cai, Peng Jiang, Xiangyu Zhao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces IMAGINE, a reinforcement learning method that enhances large language model reasoning by providing dense, exploration-driven rewards, leading to significant performance improvements on complex tasks.

Contribution

IMAGINE offers novel reward mechanisms and integration techniques to improve exploration and reasoning in large language models, addressing limitations of sparse reward signals.

Findings

01

Improves reasoning performance by 22.23% on AIME 2024.

02

Provides dense, exploration-focused rewards for better training.

03

Enhances exploration efficiency on difficult samples.

Abstract

Reinforcement Learning (RL) has become a key approach for enhancing the reasoning capabilities of large language models. However, prevalent RL approaches like proximal policy optimization and group relative policy optimization suffer from sparse, outcome-based rewards and weak exploration incentives, limiting their effectiveness. Specifically, sparse rewards offer limited feedback, especially on difficult problems, and introduce biases favoring familiar trajectories over novel reasoning paths. These issues critically undermine performance on complex tasks that inherently require iterative reasoning. To overcome these challenges, we propose Intrinsic MotivAtion Guided exploratIoN for Enhanced reasoning (IMAGINE), which delivers dense rewards and encourages exploration. IMAGINE introduces three innovations: a trajectory-aware exploration reward that reduces token-level bias efficiently;…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. This paper clearly identifies the problem of sparse reward in current RL methods, where current RL algorithms generally adopt the outcome reward but overlook the differences of among important segments or tokens. 2. This paper evaluates i-MENTOR on 2 models from different families, 2 mainstream RL algorithms, and 4 benchmarks, which is comprehensive to reveal the performance improvement from using i-MENTOR

Weaknesses

1. This paper lacks analysis for the problem they aim to address: the sparse reward and inadequate exploration. It would be better for the authors to show some empirical results and analysis to demonstrate the impact of these methods on the optimization performance. 2. The use of RND (Random Network Distillation) is heuristic. It would be better to deeply discuss the selection of RND as the process reward model. 3. The experiments show the results of two models with different sizes, which is g

Reviewer 02Rating 6Confidence 4

Strengths

1. The proposed i-MENTOR method successfully demonstrates improved performance over both baselines (PPO and GRPO) across multiple reasoning datasets. 2. The paper provides a clear and well-articulated motivation for incorporating each technical component, making the design choices behind i-MENTOR easy to follow.

Weaknesses

The paper's core claim rests on enhancing exploration quality, yet it fails to provide compelling empirical evidence or quantitative metrics that directly confirm the proposed method effectively increases or improves the quality of exploration compared to the baselines.

Reviewer 03Rating 4Confidence 4

Strengths

- The motivation for incorporating RND-based intrinsic rewards to encourage exploration is clear. - The presentation is well-structured, and the empirical results are clearly organized and easy to follow.

Weaknesses

1. The experimental settings are rather limited. Both GSM8K and Countdown are relatively simple benchmarks, lacking evaluations on more challenging or large-scale reasoning tasks. 2. The paper omits critical implementation details of RND. For example, it is unclear what the input to the predictor and target networks is, or how sequences are represented. 3. The idea of using RND for intrinsic exploration reward is not particularly novel, and most of the other design elements appear to be practi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Artificial Intelligence in Law