Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Naoki Shitanda; Motoki Omura; Tatsuya Harada; Takayuki Osa

arXiv:2603.01741·cs.LG·March 4, 2026

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Naoki Shitanda, Motoki Omura, Tatsuya Harada, Takayuki Osa

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Coupled Policy Optimization, a method that regulates diversity among policies in ensemble policy gradient algorithms to improve exploration, stability, and sample efficiency in large-scale reinforcement learning tasks.

Contribution

It provides a theoretical analysis of policy diversity effects and proposes a novel KL constraint-based regulation method that enhances exploration and learning stability.

Findings

01

Outperforms baselines like SAPG, PBT, and PPO in various tasks

02

Demonstrates structured exploration with policies distributing around a leader

03

Shows that regulated diversity improves sample efficiency and stability

Abstract

Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

* It appears to be a novel and difficult task to introduce exploration incentives to the follower agents at risk of destabilizing the already off-policy training, however this paper appears to utilize the KL divergence and the discriminator in a way that promotes some exploration for the follower agents without training collapsing. * The paper is well written, and utilizes the background and method section well. The results are well thought out. * Although the main algorithmic contribution appe

Weaknesses

* I would have liked to know why a discriminator was chosen specifically, in comparison to other exploration based algorithms, in particular there are other methods that do not require the additional external training or usage of a functional approximation [1]. * Going further on the second point, although the KL divergence effects were explained, section 5.2 appears to be the only discussion on the usage of an exploration algorithm, and it appears that there is little explanation on how the a

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper's insight is presented very clearly, and the logic is rigorous; we can easily follow the author's train of thought and logic to understand the method. The paper's writing quality is high. 2. The theoretical derivations are thorough, and the experimental validation is comprehensive. The visualization of KL divergence changes in Figure 4 is very valuable. 3. The method achieves a significant breakthrough on a high-difficulty task (Two-Arms Reorientation).

Weaknesses

1. The model's generalizability appears limited. It is only effective in specific environments, such as AllegroHand, where follower policies are prone to significant divergence (resulting in high variance). In contrast, on other tasks like Regrasping, the performance improvement is not as pronounced . 2. The training cost is somewhat high. The paper's CPO method requires more backpropagation components (roughly 12 vs. 7 for SAPG) and more wall-clock training time per iteration (approximately 25

Reviewer 03Rating 8Confidence 4

Strengths

1. The paper is well-structured and clearly written, ensuring good readability. 2. The authors rethink the existing SAPG method by providing theoretical analysis and key insights, which effectively motivate the design of the proposed CPO. 3. The experimental results are robust and supported by sufficient evidence, and the overall writing flow is smooth.

Weaknesses

1. More concrete examples could be added to illustrate and validate the key theoretical insights, which would strengthen the persuasiveness of the work. 2. The design of the proposed CPO is relatively straightforward, as it only uses KL divergence to constrain the distance between follower and leader policies, lacking further optimization or innovative adjustments. - 2.1 The selection of the lambda hyperparameter in practice is somewhat heuristic, with no clear justification provided for its c

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning