Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian

TL;DR
Orthogonalized Policy Optimization (OPO) introduces a novel Hilbert space framework for policy updates, enabling better control over optimization geometry and improving large language model alignment by avoiding gradient saturation.
Contribution
This paper presents OPO, a new policy optimization method based on orthogonal projection in Hilbert space, decoupling sampling and optimization geometries for improved model training.
Findings
Prevents gradient saturation in high-confidence regimes.
Achieves stronger long-horizon rewards.
Improves out-of-distribution generalization.
Abstract
We propose Orthogonalized Policy Optimization (OPO), a principled framework for large language model alignment derived from optimization in the Hilbert function space L2(pi_k). Lifting policy updates from the probability simplex into L2(pi_k) transforms the nonlinear normalization constraint into a linear orthogonality condition <v, 1>_{pi_k} = 0 on the density fluctuation field v = pi/pi_k - 1. By the Hilbert projection theorem, the unique closed-form update is v_star = (omega_alpha - E[omega_alpha]) / mu, where the subtracted mean acts as a chemical potential enforcing probability conservation. This interpretation reveals advantage z-score normalization as a conservation-law projection rather than a variance-reduction heuristic. OPO cleanly decouples sampling geometry, controlled by the escort exponent alpha, from optimization geometry, governed by the stiffness parameter mu, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Reinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques
