TL;DR
ExPO introduces a modular reinforcement learning framework that enhances complex reasoning by generating guided positive samples, leading to improved learning efficiency and performance on challenging benchmarks.
Contribution
The paper proposes Self-Explanation Policy Optimization (ExPO), a novel method that generates effective positive samples for RL training, surpassing expert demonstrations in reasoning tasks.
Findings
ExPO improves reasoning performance on benchmarks.
ExPO enhances learning efficiency in complex tasks.
ExPO outperforms expert-demonstration methods in challenging settings.
Abstract
Self-improvement via RL often fails on complex reasoning tasks because GRPO-style post-training methods rely on the model's initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the model already knows (distribution-sharpening) rather than enabling the model to solve problems where it initially generates no correct solutions. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model's likelihood…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
