Near-Optimal Randomized Exploration for Tabular Markov Decision Processes
Zhihan Xiong, Ruoqi Shen, Qiwen Cui, Maryam Fazel, Simon S. Du

TL;DR
This paper demonstrates that randomized exploration algorithms with a single seed and Bernstein noise can achieve near-optimal regret bounds in episodic Markov Decision Processes, matching theoretical lower bounds.
Contribution
It introduces a new analysis and techniques showing randomized value function algorithms can be nearly optimal, previously only achieved by optimistic methods.
Findings
Achieves $ ilde{O}(H oot{2}SAT)$ regret bound matching lower bounds.
Develops a new clipping operation for better optimism and pessimism control.
Introduces a recursive formula for analyzing estimation error.
Abstract
We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case regret bound for episodic time-inhomogeneous Markov Decision Process where is the size of state space, is the size of action space, is the planning horizon and is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning
