Pessimism in the Face of Confounders: Provably Efficient Offline   Reinforcement Learning in Partially Observable Markov Decision Processes

Miao Lu; Yifei Min; Zhaoran Wang; Zhuoran Yang

arXiv:2205.13589·cs.LG·April 2, 2024

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

Miao Lu, Yifei Min, Zhaoran Wang, Zhuoran Yang

PDF

Open Access 1 Video

TL;DR

This paper introduces P3O, a novel offline RL algorithm for partially observable MDPs with confounded data, providing provable efficiency and addressing bias through proximal causal inference.

Contribution

The paper proposes P3O, the first provably efficient offline RL algorithm for POMDPs with confounded datasets, using proximal causal inference to handle bias and distributional shift.

Findings

01

Achieves $n^{-1/2}$-suboptimality under partial coverage.

02

Addresses confounding bias in offline RL for POMDPs.

03

First provably efficient algorithm for this setting.

Abstract

We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Distributed Sensor Networks and Detection Algorithms · Machine Learning and Algorithms