Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Yanwei Ren; Haotian Zhang; Likang Xiao; Xikai Zhang; Jiaxing Huang; Jiayan Qiu; Baosheng Yu; Quan Chen; Liu Liu

arXiv:2602.24110·cs.AI·March 2, 2026

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang, Jiayan Qiu, Baosheng Yu, Quan Chen, Liu Liu

PDF

Open Access

TL;DR

This paper introduces SCOPE, a framework that improves reinforcement learning from verifiable rewards by salvaging partially correct trajectories through step-wise off-policy correction, enhancing exploration and accuracy.

Contribution

SCOPE leverages Process Reward Models to identify and correct the first error in rollouts, maintaining diversity and achieving state-of-the-art results in reasoning tasks.

Findings

01

Increased diversity score by 13.5%

02

Achieved 46.6% accuracy on math reasoning

03

Demonstrated robust out-of-distribution generalization

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves ineffective.Prior methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model's distribution, but still…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics