Generalization in Monitored Markov Decision Processes (Mon-MDPs)

Montaser Mohammedalamen; Michael Bowling

arXiv:2505.08988·cs.AI·May 15, 2025

Generalization in Monitored Markov Decision Processes (Mon-MDPs)

Montaser Mohammedalamen, Michael Bowling

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how reinforcement learning agents can generalize in monitored Markov decision processes with unobservable rewards using function approximation and reward models, addressing challenges like overgeneralization.

Contribution

It introduces a method combining function approximation with reward models for Mon-MDPs, and proposes cautious policy optimization to reduce overgeneralization.

Findings

01

Reward models enable near-optimal policies in Mon-MDPs.

02

Overgeneralization causes incorrect reward extrapolation.

03

Cautious optimization mitigates overgeneralization effects.

Abstract

Reinforcement learning (RL) typically models the interaction between the agent and environment as a Markov decision process (MDP), where the rewards that guide the agent's behavior are always observable. However, in many real-world scenarios, rewards are not always observable, which can be modeled as a monitored Markov decision process (Mon-MDP). Prior work on Mon-MDPs have been limited to simple, tabular cases, restricting their applicability to real-world problems. This work explores Mon-MDPs using function approximation (FA) and investigates the challenges involved. We show that combining function approximation with a learned reward model enables agents to generalize from monitored states with observable rewards, to unmonitored environment states with unobservable rewards. Therefore, we demonstrate that such generalization with a reward model achieves near-optimal policies in…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The paper effectively demonstrates the usefulness of function approximation in the context of Mon-MDPs. - It shows that incorporating robust policy optimization can help handle the over-generalization problem caused by the epistemic uncertainty of the environments.

Weaknesses

- My main concern lies in the novelty and significance of the proposed approach. The algorithm appears to be a combination of several existing techniques, including Mon-MDP, function approximation, and the "learning to be curious" framework to handle out-of-distribution data. While the paper provides extensive experimental results in the plant-watering environment, the findings are not particularly surprising. The advantages of Mon-MDP over baselines have already been shown in the tabular settin

Reviewer 02Rating 2Confidence 4

Strengths

- The Mon-MDP framework is relatively recent, and has not been investigated in adequate depth. This paper makes a first effort to shed light on some critical challenges and limitations of Mon-MDPs, while also discussing ways to mitigate them (e.g., by learning a reward model). - The experimental study contains targeted experiments that are able to confirm the authors' findings.

Weaknesses

- In my view, the novelty is limited. The authors essentially explore the use of function approximation and a learned reward model for Mon-MDPs. This is a straightforward idea without substantial innovation. Function approximation has been known for a long time to improve the RL performance and is nowadays an almost indispensable component of modern reinforcement learning systems. Adding it to Mon-MDPs makes a lot of sense, but is otherwise straightforward. Likewise, the learned reward model mak

Reviewer 03Rating 4Confidence 3

Strengths

1) The Mon-MDP setting is clearly defined. Although I do not clearly follow the impact, utility of the same specially in the example and experiments of the authors (more on this below) 2) The set of experiments ,while preliminary ,seems to be a good starting point in exploring this line of research 3)The paper is well written in the sense, at many places it explains the scope of its results and settings and does not seem to overclaim results.

Weaknesses

1) Disconnect between the model and the experiments While the Mon-MDP formalism is general, the experiments instantiate only degenerate monitors, either deterministic spatial gating of observability or a binary ask/no-ask with fixed cost. These settings could be modeled without a stateful monitor MDP, and there are some papers that do just that. To justify the general model, I recommend adding scenarios with nontrivial monitor dynamics (at least giving examples where stateful monitor MDP makes

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Bayesian Modeling and Causal Inference · Business Process Modeling and Analysis