PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates
Nilin Abrahamsen

TL;DR
PROMA introduces a novel reference-free proximal policy method that effectively controls KL divergence through gradient projection techniques, improving stability and performance in policy optimization tasks.
Contribution
The paper proposes two variants of PROMA, a new method for proximal policy updates that projects away high-variance gradient components, enhancing KL control and training compatibility.
Findings
Accumulation variant achieves tighter KL control than GRPO with PPO clipping.
Intra-microbatch variant attains the best validation performance.
Method is compatible with standard data-parallel training.
Abstract
This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis
