Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Zihan Zhang; Yuhang Jiang; Yuan Zhou; Xiangyang Ji

arXiv:2210.08238·cs.LG·October 18, 2022·1 cites

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji

PDF

Open Access 1 Video

TL;DR

This paper introduces a near-optimal algorithm for multi-batch reinforcement learning in finite-horizon MDPs, achieving low regret with minimal batch updates, and establishes lower bounds on batch complexity.

Contribution

It presents the first near-optimal regret bound with logarithmic batch complexity and provides matching lower bounds on the number of batches needed.

Findings

01

Achieves $ ilde{O}( oot{2}{}SAH^3K)$ regret with $O(H+ ext{loglog}K)$ batches.

02

Establishes lower bounds on batch complexity for near-optimal regret.

03

Introduces efficient exploration strategies for unlearned states and transition models.

Abstract

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with $S$ states, $A$ actions and planning horizon $H$ , we design a computational efficient algorithm to achieve near-optimal regret of $\tilde{O} (S A H^{3} K ln (1/ δ))$ \footnote{ $\tilde{O} (\cdot)$ hides logarithmic terms of $(S, A, H, K)$ } in $K$ episodes using $O (H + lo g_{2} lo g_{2} (K))$ batches with confidence parameter $δ$ . To our best of knowledge, it is the first $\tilde{O} (S A H^{3} K)$ regret bound with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Supply Chain and Inventory Management