Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning
Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji

TL;DR
This paper introduces a near-optimal algorithm for multi-batch reinforcement learning in finite-horizon MDPs, achieving low regret with minimal batch updates, and establishes lower bounds on batch complexity.
Contribution
It presents the first near-optimal regret bound with logarithmic batch complexity and provides matching lower bounds on the number of batches needed.
Findings
Achieves $ ilde{O}( oot{2}{}SAH^3K)$ regret with $O(H+ ext{loglog}K)$ batches.
Establishes lower bounds on batch complexity for near-optimal regret.
Introduces efficient exploration strategies for unlearned states and transition models.
Abstract
In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with states, actions and planning horizon , we design a computational efficient algorithm to achieve near-optimal regret of \footnote{ hides logarithmic terms of } in episodes using batches with confidence parameter . To our best of knowledge, it is the first regret bound with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Supply Chain and Inventory Management
