Regret Bounds for Reinforcement Learning via Markov Chain Concentration
Ronald Ortner

TL;DR
This paper introduces an optimistic algorithm for reinforcement learning in Markov decision processes, achieving near-optimal regret bounds that depend on the mixing time, states, actions, and steps, in a non-episodic setting.
Contribution
It provides the first regret bounds in the non-episodic setting with optimal dependence on key parameters using Markov chain concentration techniques.
Findings
Regret bounds of ten( ilde{O}( ext{mixing time} imes S imes A imes T))",
Applicable to uniformly ergodic Markov decision processes.
First regret bounds with optimal parameter dependence in this setting.
Abstract
We give a simple optimistic algorithm for which it is easy to derive regret bounds of after steps in uniformly ergodic Markov decision processes with states, actions, and mixing time parameter . These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
