Sleeping Experts and Bandits Approach to Constrained Markov Decision Processes
Hyeong Soo Chang

TL;DR
This paper introduces simulation-based algorithms inspired by sleeping experts and bandits strategies to find approximately optimal policies in large constrained Markov decision processes, with convergence guarantees and computational efficiency.
Contribution
It adapts sleeping experts and bandits algorithms to constrained MDPs, providing convergence analysis and computational complexity results independent of state and action space sizes.
Findings
Algorithms converge to optimal policy values
Expected performance converges with established rates
Almost-sure convergence with exponential rate
Abstract
This brief paper presents simple simulation-based algorithms for obtaining an approximately optimal policy in a given finite set in large finite constrained Markov decision processes. The algorithms are adapted from playing strategies for "sleeping experts and bandits" problem and their computational complexities are independent of state and action space sizes if the given policy set is relatively small. We establish convergence of their expected performances to the value of an optimal policy and convergence rates, and also almost-sure convergence to an optimal policy with an exponential rate for the algorithm adapted within the context of sleeping experts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
