Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

Chenxi Liu; Junjie Liang; Yuqi Jia; Bochuan Cao; Yang Bai; Heng Huang; Xun Chen

arXiv:2511.04800·cs.CL·November 10, 2025

Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

Chenxi Liu, Junjie Liang, Yuqi Jia, Bochuan Cao, Yang Bai, Heng Huang, Xun Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ERPO, a framework that enhances reinforcement learning for language models by exploring residual prompts, leading to improved reasoning performance and training diversity.

Contribution

ERPO actively explores residual prompts with zero reward variance, reactivating their training signals and improving reasoning abilities of language models.

Findings

01

ERPO outperforms strong baselines on mathematical reasoning benchmarks.

02

Increases diversity of reasoning traces during training.

03

Revives training signals from residual prompts effectively.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper is easy to follow, the method is clean and easy. Temperature increase can prevent RA overfits too fast 2. Performs well on Qwen2.5-3/7B on math tasks compared with DAPO

Weaknesses

1. Only train on Qwen2.5 models, haven't try other models like Llama/OpenThinker/Octothinker etc. From spurious reward[1], maybe better evaluate on non-Qwen2.5 models. 2. And I'm kind of concerned about whether the method can be well generalized when scaling model size and compute. In the paper the training step is only 175, and the curves look hasn't converge, and when the temperature touches T_max, keeping RA may make model overfit to some correct outputs. It's also kind of tricky to tune the

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper clearly identifies a practical limitation in current GRPO-family algorithms where residual prompts accumulate and reduce training diversity. Table 1 provides compelling evidence that this problem worsens with larger models and longer training. 2. The proposed ERPO framework is straightforward to implement and can be easily integrated into existing RLVR algorithms.

Weaknesses

1. Recent work has shown that in RLVR, even when models achieve nearly 100% accuracy on the training set, continued training can still improve performance on validation/test sets—a phenomenon known as grokking, which is also observable in reward curves and val accuracy curves. From this perspective, the problem this paper proposed may not be as significant as claimed. 2. The paper primarily conduct experiments based on DAPO. Missing some RLVR algs including vanilla GRPO, GPG, RLOO, REINFORCE++,

Reviewer 03Rating 2Confidence 4

Strengths

The idea of increasing temperature during training is novel, simple, and easy to implement. The paper is also well written and easy to follow.

Weaknesses

I have 2 major concerns with the numbers reported in the paper, specifically for the MATH500 dataset. 1. **Underperforming baselines:** The paper reports mean@4 for the 3B model to reach 50.4% and for the 7B model to reach 60.3% using DAPO. However, these numbers seem to be severely underperforming the Qwen2.5 3B and 7B Instruction tuned models at 65.9% and 75.5%. These numbers are from the official report of the Qwen Team from Table 8 and Table 9 [1]. Why is there such a large discrepancy

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics