One-Way Policy Optimization for Self-Evolving LLMs

Shuo Yang; Jinda Lu; Kexin Huang; Chiyu Ma; Shaohang Wei; Yuyang Liu; Guoyin Wang; Jingren Zhou; and Li Yuan

arXiv:2605.22156·cs.LG·May 22, 2026

One-Way Policy Optimization for Self-Evolving LLMs

Shuo Yang, Jinda Lu, Kexin Huang, Chiyu Ma, Shaohang Wei, Yuyang Liu, Guoyin Wang, Jingren Zhou, and Li Yuan

PDF

TL;DR

This paper introduces OWPO, a novel policy optimization method for self-evolving LLMs that stabilizes training and enhances reasoning by decoupling update direction from magnitude, enabling continuous self-improvement.

Contribution

OWPO uniquely separates the update direction from magnitude, applying asymmetric reweighting and iterative reference updates to promote stable, continuous self-evolution of LLMs.

Findings

01

OWPO outperforms baselines like DAPO, OPD, and MOPD.

02

It enables continuous self-evolution without external references.

03

OWPO improves reasoning capabilities of LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.