ANO: A Principled Approach to Robust Policy Optimization
Yiheng Zhang, Yiming Wang, Kaiyan Zhao, Zhenglin Wan, Jiayu Chen, Leong Hou U

TL;DR
The paper introduces ANO, a new policy optimization method that replaces hard clipping with a robust, smooth mechanism, improving stability and performance in reinforcement learning and language model alignment.
Contribution
ANO is a novel policy optimization approach based on geometric principles, providing a robust alternative to existing methods like PPO and SPO.
Findings
ANO outperforms existing methods in MuJoCo and Atari control tasks.
ANO prevents policy collapse even at high learning rates.
In LLM alignment, ANO avoids catastrophic KL divergence explosions.
Abstract
Proximal Policy Optimization (PPO) dominates reinforcement learning and LLM alignment but relies on a "hard clipping" mechanism that discards valuable gradients. Conversely, unconstrained methods like SPO expose the optimization to unbounded updates, causing severe instability and policy collapse during extreme outlier encounters. To resolve this dilemma, we introduce a principled design space for policy optimization, demonstrating that a robust estimator must inherently suppress outliers while maintaining a smooth restoration force. Guided by these geometric principles, we derive Anchored Neighborhood Optimization (ANO), a novel method that seamlessly replaces hard clipping with a redescending gradient mechanism. Extensive evaluations demonstrate ANO's empirical superiority across diverse domains. In continuous (MuJoCo) and discrete (Atari) control, ANO establishes a robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
