Online Learning in Weakly Coupled Markov Decision Processes: A Convergence Time Study
Xiaohan Wei, Hao Yu, Michael J. Neely

TL;DR
This paper studies online learning in multiple coupled MDPs with global constraints, proposing a distributed algorithm that achieves near-optimal regret and constraint violation bounds over time.
Contribution
It introduces a novel distributed online algorithm for weakly coupled MDPs with theoretical guarantees on regret and constraint violations.
Findings
Achieves $O( oot T)$ regret and constraint violation bounds.
Develops new analysis techniques combining ergodicity, mixing times, and perturbation analysis.
Provides a framework for online decision-making in complex coupled MDP systems.
Abstract
We consider multiple parallel Markov decision processes (MDPs) coupled by global constraints, where the time varying objective and constraint functions can only be observed after the decision is made. Special attention is given to how well the decision maker can perform in slots, starting from any state, compared to the best feasible randomized stationary policy in hindsight. We develop a new distributed online algorithm where each MDP makes its own decision each slot after observing a multiplier computed from past information. While the scenario is significantly more challenging than the classical online learning context, the algorithm is shown to have a tight regret and constraint violations simultaneously. To obtain such a bound, we combine several new ingredients including ergodicity and mixing time bound in weakly coupled MDPs, a new regret analysis for online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Age of Information Optimization · Advanced Wireless Network Optimization
