Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes
Miao Lu, Yifei Min, Zhaoran Wang, Zhuoran Yang

TL;DR
This paper introduces P3O, a novel offline RL algorithm for partially observable MDPs with confounded data, providing provable efficiency and addressing bias through proximal causal inference.
Contribution
The paper proposes P3O, the first provably efficient offline RL algorithm for POMDPs with confounded datasets, using proximal causal inference to handle bias and distributional shift.
Findings
Achieves $n^{-1/2}$-suboptimality under partial coverage.
Addresses confounding bias in offline RL for POMDPs.
First provably efficient algorithm for this setting.
Abstract
We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Distributed Sensor Networks and Detection Algorithms · Machine Learning and Algorithms
