Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

Tianyi Wang; Long Li; Hongcan Guo; Yibiao Chen; Yixia Li; Yong Wang; Yun Chen; Guanhua Chen

arXiv:2602.05717·cs.AI·February 6, 2026

Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

Tianyi Wang, Long Li, Hongcan Guo, Yibiao Chen, Yixia Li, Yong Wang, Yun Chen, Guanhua Chen

PDF

Open Access

TL;DR

Anchored Policy Optimization (APO) introduces a support coverage approach to reinforcement learning, preventing collapse and improving diversity and accuracy in policy optimization by balancing sharpening and restorative forces.

Contribution

APO shifts from shape matching to support coverage, providing a gradient-aligned method that enhances support support and enables elastic recovery to prevent policy collapse.

Findings

01

APO significantly improves Pass@1 accuracy.

02

APO restores diversity lost in standard policy gradients.

03

APO effectively prevents Recursive Space Contraction.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Advanced Multi-Objective Optimization Algorithms