Is Exploration or Optimization the Problem for Deep Reinforcement Learning?
Glen Berseth

TL;DR
This paper investigates whether the main challenge in deep reinforcement learning is exploration or optimization, introducing a practical sub-optimality estimator that reveals deep RL algorithms often underutilize generated experience.
Contribution
It presents a novel sub-optimality estimator to diagnose optimization limitations in deep RL and demonstrates that current methods exploit only half of the generated experience.
Findings
Deep RL methods only exploit about half of the experience they generate.
The sub-optimality gap is 2-3 times larger than the learned policy performance.
Optimization difficulties significantly limit deep RL performance.
Abstract
In the era of deep reinforcement learning, making progress is more complex, as the collected experience must be compressed into a deep model for future exploitation and sampling. Many papers have shown that training a deep learning policy under the changing state and action distribution leads to sub-optimal performance, or even collapse. This naturally leads to the concern that even if the community creates improved exploration algorithms or reward objectives, will those improvements fall on the \textit{deaf ears} of optimization difficulties. This work proposes a new \textit{practical} sub-optimality estimator to determine optimization limitations of deep reinforcement learning algorithms. Through experiments across environments and RL algorithms, it is shown that the difference between the best experience generated is 2-3 better than the policies' learned performance. This…
Peer Reviews
Decision·Submitted to ICLR 2026
- Clear and well-motivated research question. The paper addresses a fundamental question in RL—exploration vs. exploitation—that has often been debated but rarely quantified. The proposed estimator provides a concrete diagnostic tool to analyze this trade-off empirically. - Broad experimental coverage. The authors test across diverse environments and both on-policy and off-policy algorithms (PPO and DQN). Figures 3–5 (pp. 6–8) show consistent trends across MinAtar, Atari, Montezuma’s Revenge, Ha
- The proposed metric compares best vs. average trajectories but does not causally separate exploration and optimization. For example, an algorithm’s “best experience” may depend heavily on stochastic exploration artifacts rather than a genuine ability to generate diverse high-value data. - Limited algorithmic diversity. The experiments focus mainly on PPO and DQN, which, while standard, represent only a subset of deep RL paradigms. Missing are modern algorithms such as SAC, IQL, which emphasize
The paper is generally clear and well-structured. It is well-motivated and takes on an important, under-discussed question: when deep RL stalls, is the bottleneck exploration or exploitation? Bringing this issue to the forefront is valuable for both researchers and practitioners and, in my view, warrants attention regardless of whether one agrees with the specific estimator proposed.
* Soundness of the estimator: Using the top 5% highest-return trajectories as a proxy for the “experience-optimal policy” is problematic in stochastic environments. High-return episodes may result from lucky transitions or risky, low-expectation action sequences, making them non-reproducible and not necessarily exploitable by a learned policy. * Lack of analysis on learnability: The paper does not examine whether these “good trajectories” correspond to behavior that is actually learnable or gene
- The studied problem, "is exploration or optimization (exploitation) the problem for (practical) deep RL", is an important and interesting problem. - The experimental results in Section 5 are interesting.
Though this paper is interesting, I do not think the current version is ready for publication, for the following reasons: - [major] The presentation and discussion in Section 4 are too handwavy. I recommend that the authors make a major revision of it to make it more rigorous. Most importantly, please provide a mathematically rigorous definition of the **experience optimal policy** $\hat{\pi}^*$. This is crucial, since the experience optimal policy is a key concept in this paper, and is used to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Advanced Multi-Objective Optimization Algorithms
