Online Learning in Weakly Coupled Markov Decision Processes: A   Convergence Time Study

Xiaohan Wei; Hao Yu; Michael J. Neely

arXiv:1709.03465·math.OC·September 12, 2017·Proc. ACM Meas. Anal. Comput. Syst.·1 cites

Online Learning in Weakly Coupled Markov Decision Processes: A Convergence Time Study

Xiaohan Wei, Hao Yu, Michael J. Neely

PDF

Open Access

TL;DR

This paper studies online learning in multiple coupled MDPs with global constraints, proposing a distributed algorithm that achieves near-optimal regret and constraint violation bounds over time.

Contribution

It introduces a novel distributed online algorithm for weakly coupled MDPs with theoretical guarantees on regret and constraint violations.

Findings

01

Achieves $O( oot T)$ regret and constraint violation bounds.

02

Develops new analysis techniques combining ergodicity, mixing times, and perturbation analysis.

03

Provides a framework for online decision-making in complex coupled MDP systems.

Abstract

We consider multiple parallel Markov decision processes (MDPs) coupled by global constraints, where the time varying objective and constraint functions can only be observed after the decision is made. Special attention is given to how well the decision maker can perform in $T$ slots, starting from any state, compared to the best feasible randomized stationary policy in hindsight. We develop a new distributed online algorithm where each MDP makes its own decision each slot after observing a multiplier computed from past information. While the scenario is significantly more challenging than the classical online learning context, the algorithm is shown to have a tight $O (T)$ regret and constraint violations simultaneously. To obtain such a bound, we combine several new ingredients including ergodicity and mixing time bound in weakly coupled MDPs, a new regret analysis for online…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Age of Information Optimization · Advanced Wireless Network Optimization