Data- and Variance-dependent Regret Bounds for Online Tabular MDPs
Mingyi Li, Taira Tsuchiya, Kenji Yamanishi

TL;DR
This paper introduces new algorithms for online tabular MDPs that adapt to data and variance, providing refined regret bounds in both adversarial and stochastic settings, and establishes near-optimal lower bounds.
Contribution
It develops data- and variance-dependent regret bounds for online tabular MDPs using global and policy optimization methods, with new complexity measures and nearly matching lower bounds.
Findings
Achieves first-order, second-order, and path-length regret bounds in adversarial MDPs.
Provides variance-aware regret bounds in stochastic MDPs, including gap-independent and gap-dependent bounds.
Establishes regret lower bounds that nearly match the upper bounds, confirming near-optimality.
Abstract
This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization
