One-Way Policy Optimization for Self-Evolving LLMs
Shuo Yang, Jinda Lu, Kexin Huang, Chiyu Ma, Shaohang Wei, Yuyang Liu, Guoyin Wang, Jingren Zhou, and Li Yuan

TL;DR
This paper introduces OWPO, a novel policy optimization method for self-evolving LLMs that stabilizes training and enhances reasoning by decoupling update direction from magnitude, enabling continuous self-improvement.
Contribution
OWPO uniquely separates the update direction from magnitude, applying asymmetric reweighting and iterative reference updates to promote stable, continuous self-evolution of LLMs.
Findings
OWPO outperforms baselines like DAPO, OPD, and MOPD.
It enables continuous self-evolution without external references.
OWPO improves reasoning capabilities of LLMs.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
