REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs
Peter L. Bartlett, Ambuj Tewari

TL;DR
This paper introduces REGAL, an algorithm that achieves optimal regret in weakly communicating MDPs by using regularization based on the span of the optimal bias vector, improving previous bounds.
Contribution
The paper presents a novel regularization-based algorithm for reinforcement learning in weakly communicating MDPs with proven optimal regret bounds.
Findings
Achieves regret of ~O(HSpAT) in weakly communicating MDPs.
Relates span of bias vector to diameter-like quantities, improving regret bounds.
Demonstrates the effectiveness of span-based regularization in RL algorithms.
Abstract
We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of ~O(HSpAT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization
