R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

Weijie Shi; Yanxi Chen; Zexi Li; Xuchen Pan; Yuchang Sun; Jiajie Xu; Xiaofang Zhou; Yaliang Li

arXiv:2601.03715·cs.LG·January 8, 2026

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li

PDF

Open Access

TL;DR

R$^3$L introduces a novel reinforcement learning method that enhances exploration and exploitation in language models by using language-guided error diagnosis, targeted credit assignment, and positive signal amplification, leading to significant performance improvements.

Contribution

It presents R$^3$L, a new approach combining reflect-then-retry, pivotal credit assignment, and positive amplification to improve learning efficiency and stability in language model reinforcement learning.

Findings

01

Achieves 5-52% relative improvements over baselines.

02

Reduces rollout costs by restarting from failure points.

03

Maintains training stability despite off-policy data.

Abstract

Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R $^{3}$ L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R $^{3}$ L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling