The Curse of Diversity in Ensemble-Based Exploration
Zhixuan Lin, Pierluca D'Oro, Evgenii Nikishin, Aaron Courville

TL;DR
Training diverse ensembles in deep reinforcement learning can unexpectedly harm individual performance due to data sharing issues, but representation learning methods like CERL can help mitigate this problem.
Contribution
This paper identifies the 'curse of diversity' in ensemble-based exploration and proposes a novel representation learning approach, CERL, to address it.
Findings
Ensemble diversity can impair individual agent performance.
Larger replay buffers or smaller ensembles do not reliably solve the issue.
Representation learning via CERL effectively counters the curse.
Abstract
We uncover a surprising phenomenon in deep reinforcement learning: training a diverse ensemble of data-sharing agents -- a well-established exploration strategy -- can significantly impair the performance of the individual ensemble members when compared to standard single-agent training. Through careful analysis, we attribute the degradation in performance to the low proportion of self-generated data in the shared training data for each ensemble member, as well as the inefficiency of the individual ensemble members to learn from such highly off-policy data. We thus name this phenomenon the curse of diversity. We find that several intuitive solutions -- such as a larger replay buffer or a smaller ensemble size -- either fail to consistently mitigate the performance loss or undermine the advantages of ensembling. Finally, we demonstrate the potential of representation learning to…
Peer Reviews
Decision·ICLR 2024 poster
Presentation - The paper is extremely clear and excellently written. The problem, motivation, and experiments are articulated very clearly. I like that they look introspectively at their own experiments and reason clearly about what can be inferred/concluded from their experiments without making unreasonable intellectual leaps. Contributions: 1. They show that perhaps the aggregation/majority voting aspect of ensembling methods may contribute to improved performance more than previously attribu
I do not have any many major qualms with the paper, but I'll list a few thoughts. Perhaps this is out of scope of the paper, but I do feel it's difficult to draw conclusions about deep RL more broadly without investigating distributional RL. For example, this paper (https://openreview.net/pdf?id=ryeUg0VFwr) shows that distributional RL will likely do better with this off-policy data. It would be interesting to investigate the extent of this phenomena in the distributional setting. CERL does s
The reviewer liked the paper a lot. The main hypothesis makes sense and is substantiated in multiple experiments that show the effect nicely. The paper is well written and the figures are clearly readable. More detailed figures for individual environments are provided in the appendix, which is welcome to get an idea how trustworthy the aggregate performance measures are. The proposed method is not terribly innovative, but to the best knowledge of this reviewer novel. The discussion on other repr
While the main paper is very well written and the experiments appear quite thorough, the reviewer took issue with the way that some conclusions were presented. In particular the connection to exploration (which is in the title) ignores some major alternative explanations of the results. While the reviewer recommends to accept the paper, some phrases *need* to be changed, and some discussion needs to be added, to prevent the casual reader from misinterpreting the text and results. These are: 1.
I like the way this paper is presented. It is clearly motivated by empirical discoveries, together with reasonings, and followed by solutions to the identified problems. The authors made great efforts in conducting and presenting experiments. Results are reported in a statistically-identifiable way. I really appreciate it.
### On high-level Motivation: I’m lost in the motivation of using ensemble in **policy learning**. As has been demonstrated in [EDAC] and [REDQ], I acknowledge that using ensemble learning for the **value function** could lead to improved performance, as the value can be more accurate, with uncertainty. But what is the motivation for having **multiple policies** for ensemble (because they are sample generators, rather than learners). Should not those samplers aim at more efficiently decreasing
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
