Tail Distribution of Regret in Optimistic Reinforcement Learning
Sajad Khodadadian, Mehrdad Moharrami

TL;DR
This paper derives detailed tail bounds for the regret in optimistic reinforcement learning, providing insights into the probability of large deviations and the distributional behavior of regret in finite-horizon MDPs.
Contribution
It introduces explicit tail bounds for regret in both model-based and model-free optimistic RL algorithms, extending analysis beyond average regret to distributional tail behavior.
Findings
Tail bounds exhibit a two-regime structure: sub-Gaussian then sub-Weibull tails.
Bounds depend on an instance-dependent scale and a transition threshold.
Algorithms' regret bounds are adjustable via a tuning parameter lpha.
Abstract
We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. We first study a UCBVI-type (model-based) algorithm and characterize the tail distribution of the cumulative regret over episodes via explicit bounds on , going beyond analyses limited to or a single high-probability quantile. We analyze two natural exploration-bonus schedules for UCBVI: (i) a -dependent scheme that explicitly incorporates the total number of episodes , and (ii) a -independent (anytime) scheme that depends only on the current episode index. We then complement the model-based results with an analysis of optimistic Q-learning (model-free) under a -dependent bonus schedule. Across both the model-based and model-free settings, we obtain upper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization
