Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

Zelal Su (Lain) Mustafaoglu; Sungyoung Lee; Eshan Balachandar; Risto Miikkulainen; Keshav Pingali

arXiv:2603.12596·cs.LG·March 16, 2026

Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

Zelal Su (Lain) Mustafaoglu, Sungyoung Lee, Eshan Balachandar, Risto Miikkulainen, Keshav Pingali

PDF

Open Access

TL;DR

This paper introduces Consensus Aggregation for Policy Optimization (CAPO), a method that improves policy training by aggregating multiple PPO replicas to enhance trust region compliance and performance without increasing environment interactions.

Contribution

The paper proposes CAPO, a novel aggregation technique that shifts focus from deeper to wider policy optimization, with theoretical guarantees and empirical improvements over PPO.

Findings

01

CAPO outperforms PPO and deeper baselines by up to 8.6x on control tasks.

02

Aggregation in natural parameter space achieves better trust region compliance.

03

Wider optimization with consensus improves sample efficiency and policy quality.

Abstract

Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: $K$ PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Advanced Neural Network Applications