Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization
Zelal Su (Lain) Mustafaoglu, Sungyoung Lee, Eshan Balachandar, Risto Miikkulainen, Keshav Pingali

TL;DR
This paper introduces Consensus Aggregation for Policy Optimization (CAPO), a method that improves policy training by aggregating multiple PPO replicas to enhance trust region compliance and performance without increasing environment interactions.
Contribution
The paper proposes CAPO, a novel aggregation technique that shifts focus from deeper to wider policy optimization, with theoretical guarantees and empirical improvements over PPO.
Findings
CAPO outperforms PPO and deeper baselines by up to 8.6x on control tasks.
Aggregation in natural parameter space achieves better trust region compliance.
Wider optimization with consensus improves sample efficiency and policy quality.
Abstract
Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Advanced Neural Network Applications
