Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes
Zuyue Fu, Zhengling Qi, Zhaoran Wang, Zhuoran Yang, Yanxun Xu, Michael, R. Kosorok

TL;DR
This paper addresses offline reinforcement learning in confounded Markov decision processes by using instrumental variables to identify value functions and develop policy learning methods with theoretical guarantees, demonstrated through simulations.
Contribution
It introduces a novel approach leveraging instrumental variables for confounded offline RL, providing identification results and policy learning algorithms with finite-sample guarantees.
Findings
The proposed methods effectively handle unmeasured confounders.
Theoretical guarantees ensure near-optimal policy learning under minimal data coverage.
Numerical study demonstrates promising real-world applicability.
Abstract
We study the offline reinforcement learning (RL) in the face of unmeasured confounders. Due to the lack of online interaction with the environment, offline RL is facing the following two significant challenges: (i) the agent may be confounded by the unobserved state variables; (ii) the offline data collected a prior does not provide sufficient coverage for the environment. To tackle the above challenges, we study the policy learning in the confounded MDPs with the aid of instrumental variables. Specifically, we first establish value function (VF)-based and marginalized importance sampling (MIS)-based identification results for the expected total reward in the confounded MDPs. Then by leveraging pessimism and our identification results, we propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy under minimal data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAge of Information Optimization
