Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization
Brendan O'Donoghue

TL;DR
This paper introduces a differentiable optimistic objective for deep reinforcement learning that encourages efficient exploration by converting uncertainty into value through an epistemic-risk-seeking utility, with theoretical guarantees and practical algorithms.
Contribution
It proposes a new risk-seeking utility-based objective for deep RL, along with a model-free algorithm (ERSAC) that provably explores efficiently under function approximation.
Findings
The ERSAC algorithm outperforms existing exploration methods on DeepSea environment.
Combining the risk-seeking objective with replay data improves statistical efficiency.
The method achieves better performance on Atari benchmarks.
Abstract
Exploration remains a key challenge in deep reinforcement learning (RL). Optimism in the face of uncertainty is a well-known heuristic with theoretical guarantees in the tabular setting, but how best to translate the principle to deep reinforcement learning, which involves online stochastic gradients and deep network function approximators, is not fully understood. In this paper we propose a new, differentiable optimistic objective that when optimized yields a policy that provably explores efficiently, with guarantees even under function approximation. Our new objective is a zero-sum two-player game derived from endowing the agent with an epistemic-risk-seeking utility function, which converts uncertainty into value and encourages the agent to explore uncertain states. We show that the solution to this game minimizes an upper bound on the regret, with the 'players' each attempting to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
