TL;DR
This paper introduces two new methods for constructing valid confidence intervals in off-policy evaluation of reinforcement learning policies, effectively leveraging auxiliary data while providing reliable uncertainty quantification in high-stakes domains.
Contribution
It proposes a conformal prediction approach for high-dimensional state MDPs and a doubly robust inference method for average policy performance, enabling principled uncertainty estimates with augmented data.
Findings
Methods produce intervals that reliably cover true policy values.
Approaches outperform existing methods in diverse simulators and healthcare data.
Validated on MIMIC-IV dataset and multiple simulation environments.
Abstract
Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state -- such intervals are particularly important for human-centered applications.…
Peer Reviews
Decision·Submitted to ICLR 2026
The problem of using synthetically generated data for OPE is an important one, especially at a time when interaction data with a decision process can be generated in abundance.
- Mistake in the proof of Equation (4): Equation (24) → (25), the probability of a trajectory includes the transition probabilities along with the policy probabilities. In typical importance weighting, since the transition function is held constant, those terms cancel out. In this setup considered by this work, since the synthetic data comes from a *different* generative model (with a correspondingly different transition function) those terms cannot cancel out as done in these steps. - This co
This paper is well motivated and provides novel confidence intervals for OPE (CP-Gen) using an initial starting state and leverages common modern techniques for generative modeling. This is well motivated by applications in healthcare where each decision point at similar states leads to similar behavior. Additionally, the authors relax this assumption in the DR-PPI method for confidence interval calculation by removing the conditioning on starting state. The authors provide theoretical results
The main weaknesses in this paper surround the presentation of the methods and results. I struggled to understand the paper fully due to un-named variables, shift in notations, and general lack of clarity / assumptions / givens. ### Lack of clarity 1. What is the purpose of having a discount and a finite horizon? 2. Line 97 introduces IPS with $\pi(s,a)$ but $\pi(a | s)$ is used throughout the rest of the paper. Is there a meaningful difference? 3. Do you have access to the evaluation policy i
1. The paper studies a significant problem in OPE. Providing reliable CIs for policies with biased data is important and meaningful in many real world domains. 2. The paper provides solid theoretical foundations for their methods under a few assumptions. 3. The empirical comparisons across multiple domains with basline approaches show the effectiveness of their approaches.
1. Three assumptions in the theory might be strong and the authors do not seem to provide convincing arguments. Specifically, I'm not sure how the fact that policy probabilities lie in [0,1] guarantees Assumption 1 (Line 335). The authors also acknowledge that Assumption 3 is a strong assumption. 2. The CP-Gen algorithm introduces two hyperparameters $\epsilon_s$ and $\epsilon_r$, which are not standard in OPE. The authors do not provide a principled selection rule for these hyperparameters.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
