PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

Nilin Abrahamsen

arXiv:2601.10498·cs.LG·February 18, 2026

PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

Nilin Abrahamsen

PDF

Open Access

TL;DR

PROMA introduces a novel reference-free proximal policy method that effectively controls KL divergence through gradient projection techniques, improving stability and performance in policy optimization tasks.

Contribution

The paper proposes two variants of PROMA, a new method for proximal policy updates that projects away high-variance gradient components, enhancing KL control and training compatibility.

Findings

01

Accumulation variant achieves tighter KL control than GRPO with PPO clipping.

02

Intra-microbatch variant attains the best validation performance.

03

Method is compatible with standard data-parallel training.

Abstract

This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis