REGAL: A Regularization based Algorithm for Reinforcement Learning in   Weakly Communicating MDPs

Peter L. Bartlett; Ambuj Tewari

arXiv:1205.2661·cs.LG·May 14, 2012·142 cites

REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs

Peter L. Bartlett, Ambuj Tewari

PDF

Open Access

TL;DR

This paper introduces REGAL, an algorithm that achieves optimal regret in weakly communicating MDPs by using regularization based on the span of the optimal bias vector, improving previous bounds.

Contribution

The paper presents a novel regularization-based algorithm for reinforcement learning in weakly communicating MDPs with proven optimal regret bounds.

Findings

01

Achieves regret of ~O(HSpAT) in weakly communicating MDPs.

02

Relates span of bias vector to diameter-like quantities, improving regret bounds.

03

Demonstrates the effectiveness of span-based regularization in RL algorithms.

Abstract

We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of ~O(HSpAT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization