Model-Based Exploration in Monitored Markov Decision Processes
Alireza Kazemipour, Simone Parisi, Matthew E. Taylor, Michael Bowling

TL;DR
This paper introduces a model-based algorithm for Monitored Markov Decision Processes that improves upon prior methods by leveraging problem structure, providing finite-sample guarantees, and demonstrating faster convergence in benchmarks.
Contribution
It presents a novel model-based algorithm for Mon-MDPs that addresses previous limitations, including leveraging known monitors and providing finite-sample performance bounds.
Findings
Faster convergence than prior algorithms in benchmarks
Significant improvement when the monitor is known
First finite-sample bound on performance in Mon-MDPs
Abstract
A tenet of reinforcement learning is that the agent always observes rewards. However, this is not true in many realistic settings, e.g., a human observer may not always be available to provide rewards, sensors may be limited or malfunctioning, or rewards may be inaccessible during deployment. Monitored Markov decision processes (Mon-MDPs) have recently been proposed to model such settings. However, existing Mon-MDP algorithms have several limitations: they do not fully exploit the problem structure, cannot leverage a known monitor, lack worst-case guarantees for 'unsolvable' Mon-MDPs without specific initialization, and offer only asymptotic convergence proofs. This paper makes three contributions. First, we introduce a model-based algorithm for Mon-MDPs that addresses these shortcomings. The algorithm employs two instances of model-based interval estimation: one to ensure that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization
