TL;DR
This paper introduces failure-prefix conditioning, a method to enhance reasoning models' learning from saturated problems by focusing on failure states, improving performance without needing harder problems.
Contribution
The paper proposes failure-prefix conditioning, a novel technique that leverages failure trajectories to improve learning from saturated reasoning problems in RLVR.
Findings
Failure-prefix conditioning improves model performance on saturated problems.
The method reduces performance degradation caused by misleading failure prefixes.
Iterative failure prefix refresh further enhances learning after initial plateaus.
Abstract
As Reinforcement Learning with Verifiable Rewards (RLVR) substantially improves the reasoning abilities of large language models (LLMs), a new bottleneck emerges: more training problems become saturated, that is, the LLM answers the questions correctly for nearly every rollout. On such problems, rewards provide little useful learning signal. While collecting harder problems is a natural response, it is costly and increasingly difficult. We propose failure-prefix conditioning, a simple method that unlocks the remaining signal in saturated problems by shifting exploration toward failure-prone reasoning states. By conditioning on prefixes of rare incorrect trajectories, the method improves the model's ability to recover from misleading early reasoning. We observe that failure-prefix conditioning consistently improves performance where standard RLVR stalls, and achieves gains comparable to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
