Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Zhenwen Liang; Sidi Lu; Wenhao Yu; Kishan Panaganti; Yujun Zhou; Haitao Mi; Dong Yu

arXiv:2512.15687·cs.LG·December 18, 2025

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, Dong Yu

PDF

Open Access

TL;DR

This paper introduces G2RL, a gradient-guided reinforcement learning method that aligns exploration with the model's own update geometry, significantly improving reasoning performance in large language models.

Contribution

G2RL is the first exploration mechanism that uses the model's own gradient geometry for reinforcement learning, enhancing reasoning abilities in LLMs.

Findings

01

G2RL outperforms entropy-based methods on multiple reasoning benchmarks.

02

It encourages exploration in more orthogonal and opposing gradient directions.

03

G2RL maintains semantic coherence while expanding exploration space.

Abstract

Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)