Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning
Naoki Shitanda, Motoki Omura, Tatsuya Harada, Takayuki Osa

TL;DR
This paper introduces Coupled Policy Optimization, a method that regulates diversity among policies in ensemble policy gradient algorithms to improve exploration, stability, and sample efficiency in large-scale reinforcement learning tasks.
Contribution
It provides a theoretical analysis of policy diversity effects and proposes a novel KL constraint-based regulation method that enhances exploration and learning stability.
Findings
Outperforms baselines like SAPG, PBT, and PPO in various tasks
Demonstrates structured exploration with policies distributing around a leader
Shows that regulated diversity improves sample efficiency and stability
Abstract
Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in…
Peer Reviews
Decision·ICLR 2026 Poster
* It appears to be a novel and difficult task to introduce exploration incentives to the follower agents at risk of destabilizing the already off-policy training, however this paper appears to utilize the KL divergence and the discriminator in a way that promotes some exploration for the follower agents without training collapsing. * The paper is well written, and utilizes the background and method section well. The results are well thought out. * Although the main algorithmic contribution appe
* I would have liked to know why a discriminator was chosen specifically, in comparison to other exploration based algorithms, in particular there are other methods that do not require the additional external training or usage of a functional approximation [1]. * Going further on the second point, although the KL divergence effects were explained, section 5.2 appears to be the only discussion on the usage of an exploration algorithm, and it appears that there is little explanation on how the a
1. The paper's insight is presented very clearly, and the logic is rigorous; we can easily follow the author's train of thought and logic to understand the method. The paper's writing quality is high. 2. The theoretical derivations are thorough, and the experimental validation is comprehensive. The visualization of KL divergence changes in Figure 4 is very valuable. 3. The method achieves a significant breakthrough on a high-difficulty task (Two-Arms Reorientation).
1. The model's generalizability appears limited. It is only effective in specific environments, such as AllegroHand, where follower policies are prone to significant divergence (resulting in high variance). In contrast, on other tasks like Regrasping, the performance improvement is not as pronounced . 2. The training cost is somewhat high. The paper's CPO method requires more backpropagation components (roughly 12 vs. 7 for SAPG) and more wall-clock training time per iteration (approximately 25
1. The paper is well-structured and clearly written, ensuring good readability. 2. The authors rethink the existing SAPG method by providing theoretical analysis and key insights, which effectively motivate the design of the proposed CPO. 3. The experimental results are robust and supported by sufficient evidence, and the overall writing flow is smooth.
1. More concrete examples could be added to illustrate and validate the key theoretical insights, which would strengthen the persuasiveness of the work. 2. The design of the proposed CPO is relatively straightforward, as it only uses KL divergence to constrain the distance between follower and leader policies, lacking further optimization or innovative adjustments. - 2.1 The selection of the lambda hyperparameter in practice is somewhat heuristic, with no clear justification provided for its c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning
