Off-Policy Evaluation in Partially Observable Environments
Guy Tennenholtz, Shie Mannor, Uri Shalit

TL;DR
This paper addresses the challenge of off-policy evaluation in partially observable environments, introducing a new model and demonstrating methods to reduce bias and estimation errors in POMDPs.
Contribution
It defines the off-policy evaluation problem for POMDPs, introduces the Decoupled POMDP model, and provides new evaluation techniques to mitigate bias.
Findings
Importance Sampling performs poorly in POMDPs due to bias.
Decoupled POMDP model reduces estimation errors.
Synthetic medical data demonstrates improved evaluation methods.
Abstract
This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
