ISOPO: Proximal policy gradients without pi-old
Nilin Abrahamsen

TL;DR
ISOPO introduces an efficient single-step approximation of the natural policy gradient that normalizes log-probability gradients, outperforming existing methods that require multiple gradient steps and importance ratio clipping.
Contribution
The paper proposes ISOPO, a novel method for approximating the natural policy gradient in a single step, reducing computational overhead compared to prior proximal policy optimization techniques.
Findings
ISOPO achieves natural policy gradient approximation in one gradient step.
It can be implemented with negligible additional computational cost.
ISOPO outperforms existing methods like GRPO and CISPO in efficiency.
Abstract
This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing
