Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Ronald Ortner

arXiv:1808.01813·cs.LG·January 23, 2019

Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Ronald Ortner

PDF

TL;DR

This paper introduces an optimistic algorithm for reinforcement learning in Markov decision processes, achieving near-optimal regret bounds that depend on the mixing time, states, actions, and steps, in a non-episodic setting.

Contribution

It provides the first regret bounds in the non-episodic setting with optimal dependence on key parameters using Markov chain concentration techniques.

Findings

01

Regret bounds of ten( ilde{O}( ext{mixing time} imes S imes A imes T))",

02

Applicable to uniformly ergodic Markov decision processes.

03

First regret bounds with optimal parameter dependence in this setting.

Abstract

We give a simple optimistic algorithm for which it is easy to derive regret bounds of $\tilde{O} (t_{mix} S A T)$ after $T$ steps in uniformly ergodic Markov decision processes with $S$ states, $A$ actions, and mixing time parameter $t_{mix}$ . These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.