Thompson Sampling for Infinite-Horizon Discounted Decision Processes
Daniel Adelman, Cagla Keceli, Alba V. Olivares-Nadal

TL;DR
This paper extends Thompson sampling to infinite-horizon discounted MDPs with complex state spaces, introducing new regret metrics and proving exponential convergence of the residual regret.
Contribution
It develops a novel framework for analyzing learning in broad Borel state spaces, including new regret decompositions and convergence results for Thompson sampling.
Findings
Residual regret for Thompson sampling converges to zero exponentially fast.
New regret metrics decompose overall regret into finite, state, and residual components.
Thompson sampling achieves complete learning under mild conditions.
Abstract
This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces, whose rewards and transitions depend on an unknown parameter. To analyze adaptive learning algorithms based on sampling we introduce a general canonical probability space in this setting. Since standard definitions of regret are inadequate for policy evaluation in this setting, we propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret. Component (i) translates into the traditional concept of expected regret over a finite horizon. Term (ii) reflects how much future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
