Concentration of Cumulative Reward in Markov Decision Processes
Borna Sayedana, Peter E. Caines, Aditya Mahajan

TL;DR
This paper studies how the total reward in Markov Decision Processes concentrates around its expected value, providing unified asymptotic and non-asymptotic bounds applicable to various settings, with implications for policy comparison and learning.
Contribution
It introduces a unified framework for reward concentration in MDPs, covering both asymptotic and non-asymptotic regimes, and explores implications for policy evaluation and regret definitions.
Findings
Established law of large numbers and CLT for MDP rewards
Derived Azuma-Hoeffding-type inequalities for finite-horizon rewards
Showed rate-equivalence of different regret definitions
Abstract
In this paper, we investigate the concentration properties of cumulative reward in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. Our asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithms, while our non-asymptotic bounds include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithms. Additionally, we explore two key implications of our results. First, we analyze the sample path behavior of the difference in rewards between any two stationary policies. Second, we show that two alternative definitions of regret for learning policies proposed in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Research in Systems and Signal Processing
