Quantile-Optimal Policy Learning under Unmeasured Confounding

Zhongren Chen; Siyu Chen; Zhengling Qi; Xiaohong Chen; Zhuoran Yang

arXiv:2506.07140·stat.ML·June 10, 2025

Quantile-Optimal Policy Learning under Unmeasured Confounding

Zhongren Chen, Siyu Chen, Zhengling Qi, Xiaohong Chen, Zhuoran Yang

PDF

Open Access 4 Reviews

TL;DR

This paper introduces novel causal-assisted methods for offline quantile-optimal policy learning in the presence of unmeasured confounding, providing strong theoretical guarantees and addressing key challenges like nonlinearity and limited data coverage.

Contribution

It develops the first sample-efficient algorithms for quantile-optimal policy learning under unmeasured confounding, using instrumental variables, negative controls, and minimax estimation.

Findings

01

Achieves $ ilde{O}(n^{-1/2})$ quantile-optimality under mild coverage conditions

02

Proposes a regularized, computationally friendly policy learning method

03

Provides theoretical guarantees for policy performance in confounded offline settings

Abstract

We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $α$ -quantile for some $α \in (0, 1)$ . We focus on the offline setting whose generating process involves unobserved confounders. Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset. To address these challenges, we propose a suite of causal-assisted policy learning methods that provably enjoy strong theoretical guarantees under mild conditions. In particular, to address (i) and (ii), using causal inference tools such as instrumental variables and negative controls, we propose to estimate the quantile objectives by solving nonlinear functional integral equations. Then we adopt a minimax…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 3

Strengths

1. I think the authors study an important problem as the issue of unmeasured confounding / partial observability is quite common in offline reinforcement learning. The proposed method also combines causal inference techniques with pessimism based policy optimization methods from offline reinforcement learning. 2. The theoretical results are interesting and provide statistical rates under the two settings -- (a) instrumental variables, and (b) negative controls. However, the results are proven u

Weaknesses

1. Since the paper deals with causal inference with unmeasured confounding, it should include experiments on real-world datasets. Handling real-world datasets introduces many challenges e.g. discrete outcomes, missing covariates etc. The current experimental setup considers a simulation based setting that considers linear functionals, one-dimensional context, and is too simple to demonstrate the effectiveness of the proposed method. 2. The authors make several strong assumptions in order to pro

Reviewer 02Rating 5Confidence 4

Strengths

- This new setting sounds interesting, and authors did a good job of motivating the setting relatively well (though also see weakness on this point). - The writing quality is in general quite good, except the technical part (also see weakness on this point).

Weaknesses

- The authors motivated the unmeasured confounders with a healthcare example, and quantile optimization with job training programs. However, these are distinct cases, and I wonder whether there is one *unified* motivating example that requires both. - Section 4, while highly technical, is the most important part of the paper as the main contribution of the paper claims novel methodologies. Such section must be written with extra care on the clarity even though it conveys complex ideas. As it s

Reviewer 03Rating 5Confidence 3

Strengths

It is interesting to see how the tricks from the theory of offline RL also apply in this setting.

Weaknesses

**Readability of this paper.** The paper is difficult to read. 1. The notations are horrible. Letter W is both NCO variable and a function of (D,h). $\mathcal{O}$ stands for both observation space and big-O notation. 2. The bounds hide most dependency on the parameters, which makes it difficult how assumptions influence the results. For example, the dependency hidden in $o_p(\cdot)$ is never properly specified. 3. The assumptions are stated messily and lack sufficient explanation. For example

Reviewer 04Rating 5Confidence 3

Strengths

1. This work considers reward distribution and unobserved confounding, a challenging yet highly practical setting, especially as unobserved confounding is prevalent in real applications. 2. The algorithms leverage the Pessimism Principle within minimax estimation to address issues arising from insufficient sample size. Furthermore, the authors introduce a regularized version to handle computational difficulty. 3. Theoretical results are substantial. The authors discuss both IV and NC scenarios i

Weaknesses

1. The motivation for studying quantiles in policy learning is not clear. The authors use an example of income in a job training program to illustrate the relevance of analyzing reward distribution, which has merit but remains insufficient. In particular, the related works section lacks any mention of quantile policy learning literature. Although work in this area is limited, introducing relevant studies on quantile treatment regimes and quantile treatment effects would help clarify the signific

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Causal Inference Techniques · Statistical Methods and Inference · Advanced Bandit Algorithms Research

MethodsADaptive gradient method with the OPTimal convergence rate · Focus · Causal inference