Reinforcement Learning with Continuous Actions Under Unmeasured Confounding
Yuhan Li, Eugene Han, Yifan Hu, Wenzhuo Zhou, Zhengling Qi, Yifan Cui,, Ruoqing Zhu

TL;DR
This paper develops a novel approach for offline reinforcement learning with continuous actions in the presence of unmeasured confounders, providing theoretical guarantees and practical algorithms for policy optimization.
Contribution
It introduces a new identification method and a minimax estimator for policy evaluation under unmeasured confounding in continuous action spaces, advancing the field.
Findings
The proposed estimator is consistent and has finite-sample error bounds.
The policy-gradient algorithm effectively identifies optimal policies.
Empirical results demonstrate improved policy performance in simulations and real data.
Abstract
This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces when unmeasured confounders are present. While most existing research focuses on policy evaluation within partially observable Markov decision processes (POMDPs) and assumes discrete action spaces, we advance this field by establishing a novel identification result to enable the nonparametric estimation of policy value for a given target policy under an infinite-horizon framework. Leveraging this identification, we develop a minimax estimator and introduce a policy-gradient-based algorithm to identify the in-class optimal policy that maximizes the estimated policy value. Furthermore, we provide theoretical results regarding the consistency, finite-sample error bound, and regret bound of the resulting optimal policy. Extensive simulations and a real-world application…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
