Off-Policy Evaluation in Partially Observable Environments

Guy Tennenholtz; Shie Mannor; Uri Shalit

arXiv:1909.03739·cs.LG·November 26, 2019

Off-Policy Evaluation in Partially Observable Environments

Guy Tennenholtz, Shie Mannor, Uri Shalit

PDF

TL;DR

This paper addresses the challenge of off-policy evaluation in partially observable environments, introducing a new model and demonstrating methods to reduce bias and estimation errors in POMDPs.

Contribution

It defines the off-policy evaluation problem for POMDPs, introduces the Decoupled POMDP model, and provides new evaluation techniques to mitigate bias.

Findings

01

Importance Sampling performs poorly in POMDPs due to bias.

02

Decoupled POMDP model reduces estimation errors.

03

Synthetic medical data demonstrates improved evaluation methods.

Abstract

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.