Model-Based Exploration in Monitored Markov Decision Processes

Alireza Kazemipour; Simone Parisi; Matthew E. Taylor; Michael Bowling

arXiv:2502.16772·cs.LG·March 24, 2026

Model-Based Exploration in Monitored Markov Decision Processes

Alireza Kazemipour, Simone Parisi, Matthew E. Taylor, Michael Bowling

PDF

Open Access 1 Video

TL;DR

This paper introduces a model-based algorithm for Monitored Markov Decision Processes that improves upon prior methods by leveraging problem structure, providing finite-sample guarantees, and demonstrating faster convergence in benchmarks.

Contribution

It presents a novel model-based algorithm for Mon-MDPs that addresses previous limitations, including leveraging known monitors and providing finite-sample performance bounds.

Findings

01

Faster convergence than prior algorithms in benchmarks

02

Significant improvement when the monitor is known

03

First finite-sample bound on performance in Mon-MDPs

Abstract

A tenet of reinforcement learning is that the agent always observes rewards. However, this is not true in many realistic settings, e.g., a human observer may not always be available to provide rewards, sensors may be limited or malfunctioning, or rewards may be inaccessible during deployment. Monitored Markov decision processes (Mon-MDPs) have recently been proposed to model such settings. However, existing Mon-MDP algorithms have several limitations: they do not fully exploit the problem structure, cannot leverage a known monitor, lack worst-case guarantees for 'unsolvable' Mon-MDPs without specific initialization, and offer only asymptotic convergence proofs. This paper makes three contributions. First, we introduce a model-based algorithm for Mon-MDPs that addresses these shortcomings. The algorithm employs two instances of model-based interval estimation: one to ensure that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Model-Based Exploration in Monitored Markov Decision Processes· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization