TL;DR
This paper develops a method to accurately estimate confidence intervals for policy values in offline reinforcement learning settings where unmeasured confounders exist, using auxiliary variables to ensure identifiability.
Contribution
It introduces a novel approach for off-policy value estimation in confounded Markov decision processes, addressing a key gap in existing methods.
Findings
Method is robust to model misspecification
Provides rigorous uncertainty quantification
Validated on simulated and real ridesharing data
Abstract
This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
