Efficient and Sharp Off-Policy Learning under Unobserved Confounding
Konstantin Hess, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel

TL;DR
This paper introduces a new method for personalized off-policy learning that accounts for unobserved confounding, providing a sharp, efficient estimator that improves policy robustness in sensitive decision-making domains.
Contribution
It develops a semi-parametrically efficient estimator for sharp bounds on the value function under unobserved confounding, avoiding unstable optimization and enabling confounding-robust policy improvement.
Findings
Outperforms existing baselines in synthetic and real data experiments
Provides a stable, efficient estimator avoiding inverse propensity weighting
Enables robust policy improvement under unobserved confounding
Abstract
We develop a novel method for personalized off-policy learning in scenarios with unobserved confounding. Thereby, we address a key limitation of standard policy learning: standard policy learning assumes unconfoundedness, meaning that no unobserved factors influence both treatment assignment and outcomes. However, this assumption is often violated, because of which standard policy learning produces biased estimates and thus leads to policies that can be harmful. To address this limitation, we employ causal sensitivity analysis and derive a semi-parametrically efficient estimator for a sharp bound on the value function under unobserved confounding. Our estimator has three advantages: (1) Unlike existing works, our estimator avoids unstable minimax optimization based on inverse propensity weighted outcomes. (2) Our estimator is semi-parametrically efficient. (3) We prove that our…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is clearly written and easy to understand. - The main contribution of this paper is a semiparametrically efficient estimator for offline robust policy learning problem, arguing that the approach of Kallus and Zhou 2020 may be unstable due to the dependence on inverse propensity weights. Instability of inverse propensity weights is a known problem that can lead to instability of estimators. - They propose a naive plug-in estimator for the optimal robust policy but note that it will su
- The problem of robust offline policy learning under the marginal sensitivity model and Rosenbaum selection model is quite well-studied, e.g. Aronow and Lee 2013, Miratrix et al. 2018, Zhao et al. 2019, Yadlowsky et al. 2018, Kallus et al. 2018, Kallus and Zhou 2020. Furthermore, other works such as Bruns-Smith and Zhou, 2023 consider dynamic policy learning. So, the problem that the authors aim to solve has limited novelty. Nevertheless, this paper does cite and reference many of the relevant
Policy learning under unmeasured confounding is an important problem with broad applications. The authors identify a key gap in the literature and address it via appropriate methods. Presentation of the work is effective: I especially appreciate Figure 2 illustrating how the concept of sharpness connects to regret. The theoretical results - including identification bounds (4.1), bias-corrected estimator (4.2), and learning guarantees (4.3) are also well suited to this problem setting. The synthe
## Connection to prior work & significance In general, the authors provide solid coverage of prior work and appropriately situate the contribution in the literature. However, this work can be viewed as a targeted improvement on top of the basic framework established in Kallus & Zhou (2018a; 2021). While I still believe such work is valuable and worthy of publication, this somewhat limits the significance of the results. More specifically, it would be helpful if the authors could provide more d
- Closed-form sharp bound for value under MSM - One-step bias-corrected estimator hits the efficiency bound - Learning guarantees to the optimal confounding-robust policy
- The EIF and one-step estimator rely on quantiles $F_{x,a}^{-1}(\alpha_{\pm})$. You do not state standard conditions ensuring pathwise differentiability. - You claims the estimator “is semi-parametrically efficient” and points to D.2, which provides an influence function expression and cites a chain-rule lemma. But you never identify the canonical gradient in the nonparametric model nor verify your influence function equals it. - Theorem 4.4 needs a uniform bound, but the current version is poi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Smart Grid Energy Management
