Asymptotically Optimal Policies for Weakly Coupled Markov Decision Processes
Diego Goldsztajn, Konstantin Avrachenkov

TL;DR
This paper develops asymptotically optimal policies for weakly coupled Markov decision processes, generalizing restless bandits, by connecting to a deterministic control problem and providing explicit policy constructions.
Contribution
It introduces a novel approach linking large-scale weakly coupled MDPs to a deterministic control problem, enabling explicit asymptotically optimal policies under broad conditions.
Findings
Policies achieve maximum expected average reward as the number of processes grows large.
Sufficient conditions for policy optimality are satisfied under common assumptions like unichain and aperiodicity.
Numerical experiments confirm the theoretical results in multichain setups.
Abstract
We consider the problem of maximizing the expected average reward obtained over an infinite time horizon by weakly coupled Markov decision processes. Our setup is a substantial generalization of the multi-armed restless bandit problem that allows for multiple actions and constraints. We establish a connection with a deterministic and continuous-variable control problem where the objective is to maximize the average reward derived from an occupancy measure that represents the empirical distribution of the processes when . We show that a solution of this fluid problem can be used to construct policies for the weakly coupled processes that achieve the maximum expected average reward as , and we give sufficient conditions for the existence of solutions. Under certain assumptions on the constraints, we prove that these conditions are automatically satisfied if…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
