Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
Yuning Wu, Ke Wang, Devin Chen, Kai Wei

TL;DR
HAPO introduces a hindsight mechanism with a self-paced curriculum to improve reinforcement learning in sparse reward environments, ensuring unbiased policy updates and overcoming limitations of static teacher guidance.
Contribution
HAPO's synthetic success injection and gating mechanism provide a novel, theoretically grounded approach for stable, unbiased policy optimization in challenging sparse reward settings.
Findings
HAPO achieves asymptotic consistency and unbiased gradients.
The hindsight mechanism improves learning efficiency in sparse rewards.
The method surpasses static teacher forcing limitations.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
