Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space

Wang Zixian

arXiv:2601.12415·cs.LG·February 26, 2026

Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space

Wang Zixian

PDF

Open Access

TL;DR

Orthogonalized Policy Optimization (OPO) introduces a novel Hilbert space framework for policy updates, enabling better control over optimization geometry and improving large language model alignment by avoiding gradient saturation.

Contribution

This paper presents OPO, a new policy optimization method based on orthogonal projection in Hilbert space, decoupling sampling and optimization geometries for improved model training.

Findings

01

Prevents gradient saturation in high-confidence regimes.

02

Achieves stronger long-horizon rewards.

03

Improves out-of-distribution generalization.

Abstract

We propose Orthogonalized Policy Optimization (OPO), a principled framework for large language model alignment derived from optimization in the Hilbert function space L2(pi_k). Lifting policy updates from the probability simplex into L2(pi_k) transforms the nonlinear normalization constraint into a linear orthogonality condition <v, 1>_{pi_k} = 0 on the density fluctuation field v = pi/pi_k - 1. By the Hilbert projection theorem, the unique closed-form update is v_star = (omega_alpha - E[omega_alpha]) / mu, where the subtracted mean acts as a chemical potential enforcing probability conservation. This interpretation reveals advantage z-score normalization as a conservation-law projection rather than a variance-reduction heuristic. OPO cleanly decouples sampling geometry, controlled by the escort exponent alpha, from optimization geometry, governed by the stiffness parameter mu, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Reinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques