Near-Optimal Randomized Exploration for Tabular Markov Decision   Processes

Zhihan Xiong; Ruoqi Shen; Qiwen Cui; Maryam Fazel; Simon S. Du

arXiv:2102.09703·cs.LG·October 14, 2022·1 cites

Near-Optimal Randomized Exploration for Tabular Markov Decision Processes

Zhihan Xiong, Ruoqi Shen, Qiwen Cui, Maryam Fazel, Simon S. Du

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that randomized exploration algorithms with a single seed and Bernstein noise can achieve near-optimal regret bounds in episodic Markov Decision Processes, matching theoretical lower bounds.

Contribution

It introduces a new analysis and techniques showing randomized value function algorithms can be nearly optimal, previously only achieved by optimistic methods.

Findings

01

Achieves $ ilde{O}(H oot{2}SAT)$ regret bound matching lower bounds.

02

Develops a new clipping operation for better optimism and pessimism control.

03

Introduces a recursive formula for analyzing estimation error.

Abstract

We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $O (H S A T)$ regret bound for episodic time-inhomogeneous Markov Decision Process where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the $Ω (H S A T)$ lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Near-Optimal Randomized Exploration for Tabular Markov Decision Processes· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning