First Return, Entropy-Eliciting Explore

Tianyu Zheng; Tianshun Xing; Qingshui Gu; Taoran Liang; Xingwei Qu; Xin Zhou; Yizhi Li; Zhoufutu Wen; Chenghua Lin; Wenhao Huang; Qian Liu; Ge Zhang; and Zejun Ma

arXiv:2507.07017·cs.AI·July 10, 2025

First Return, Entropy-Eliciting Explore

Tianyu Zheng, Tianshun Xing, Qingshui Gu, Taoran Liang, Xingwei Qu, Xin Zhou, Yizhi Li, Zhoufutu Wen, Chenghua Lin, Wenhao Huang, Qian Liu, Ge Zhang, and Zejun Ma

PDF

Open Access 3 Reviews

TL;DR

FR3E is a structured exploration framework for reinforcement learning in large language models that enhances reasoning stability, coherence, and correctness by targeting high-uncertainty decision points with semantically grounded feedback.

Contribution

FR3E introduces a novel exploration method that identifies high-uncertainty points and performs targeted rollouts to improve reasoning in LLMs without dense supervision.

Findings

01

Promotes more stable training in LLM reasoning.

02

Produces longer, more coherent responses.

03

Increases fully correct reasoning trajectories.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Performance gains: FR3E demonstrates either superior or at least competitive performance compared to GRPO++, notably for general-purpose LLMs (Qwen2.5-7B, Qwen2.5-32B), with more modest gains for domain-specific models (Qwen2.5-Math-7B). 2. Improved training dynamics: FR3E shows notably higher and more stable entropy throughout training, visible in Figure 3, suggesting healthier exploration and avoidance of entropy collapse. 3. Fine-grained credit assignment: The adaptive advantage modulati

Weaknesses

### Method 1. The process of advantage calculation is insufficiently descriptive in the main text. I have seen Appendix C, but still a little confused. For trajectories that share the same prefix (*e.g.*, $P_{j, m], P_{j,0}$ ) but different rewards, do they have different advantages on the shared tokens? For one trajectory, are the advantages over all tokens in FR3E the same, or do they differ depending on the divided state? ### Experiments 2. Missing relevant hyperparameters: In Section 4, the

Reviewer 02Rating 4Confidence 3

Strengths

* topic is relevant and timely * clearly written * reasonable evaluation: * models of three different sizes are tested * quite a few benchmarks tested

Weaknesses

* limited novelty: Besta et al. (Reasoning Language Models: A Blueprint, arXiv:2501.11223, Jan. 2025) already propose the use of entropy as a metric to identify decisions point as well as outcome-driven process based rewards, albeit to be fair they only present their ideas without actually implementing and evaluating them. * some aspects of evaluation I expected are missing: * GRPO++ details are missing, inclusive a discussion why GRPO++ is a competitive baseline * 5.1: hyperparameter such a

Reviewer 03Rating 4Confidence 4

Strengths

1) The proposed idea is simple and sound (targeted exploration at high-entropy states), achieving good overall performance on math tasks. 2) The authors have conducted further model analyses on the training dynamics, sources of gains for a better understanding. 3) The writing is clear and the method is easy to follow.

Weaknesses

1) This work mainly concentrates on the math tasks. Is this work still effective in other tasks, such as more challenging agent-related scenarios with sparser reward signals? 2) There have been quite a few entropy-aware RL methods recently, which can be mentioned in related works (the differences should be discussed to highlight the contribution proposed by this work). 3) The base reasoning trajectory is essential in FR3E. It is strongly suggested that the authors could give an in-depth discussi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)