Settling the Sample Complexity of Online Reinforcement Learning
Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon S. Du

TL;DR
This paper proves that a modified model-based algorithm achieves minimax-optimal regret in finite-horizon online RL without burn-in costs, significantly advancing data efficiency and theoretical understanding.
Contribution
It establishes the first regret bounds matching minimax lower bounds for all sample sizes in finite-horizon RL, removing the burn-in requirement.
Findings
Achieves regret of order K, matching minimax lower bounds.
Provides a minimax-optimal PAC sample complexity of / ^2.
Develops new analysis techniques to handle statistical dependencies in online RL.
Abstract
A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where is the number of states, is the number of actions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Mental Health Research Topics
