Efficient and Sharp Off-Policy Learning under Unobserved Confounding

Konstantin Hess; Dennis Frauen; Valentyn Melnychuk; Stefan Feuerriegel

arXiv:2502.13022·cs.LG·February 18, 2026

Efficient and Sharp Off-Policy Learning under Unobserved Confounding

Konstantin Hess, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new method for personalized off-policy learning that accounts for unobserved confounding, providing a sharp, efficient estimator that improves policy robustness in sensitive decision-making domains.

Contribution

It develops a semi-parametrically efficient estimator for sharp bounds on the value function under unobserved confounding, avoiding unstable optimization and enabling confounding-robust policy improvement.

Findings

01

Outperforms existing baselines in synthetic and real data experiments

02

Provides a stable, efficient estimator avoiding inverse propensity weighting

03

Enables robust policy improvement under unobserved confounding

Abstract

We develop a novel method for personalized off-policy learning in scenarios with unobserved confounding. Thereby, we address a key limitation of standard policy learning: standard policy learning assumes unconfoundedness, meaning that no unobserved factors influence both treatment assignment and outcomes. However, this assumption is often violated, because of which standard policy learning produces biased estimates and thus leads to policies that can be harmful. To address this limitation, we employ causal sensitivity analysis and derive a semi-parametrically efficient estimator for a sharp bound on the value function under unobserved confounding. Our estimator has three advantages: (1) Unlike existing works, our estimator avoids unstable minimax optimization based on inverse propensity weighted outcomes. (2) Our estimator is semi-parametrically efficient. (3) We prove that our…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper is clearly written and easy to understand. - The main contribution of this paper is a semiparametrically efficient estimator for offline robust policy learning problem, arguing that the approach of Kallus and Zhou 2020 may be unstable due to the dependence on inverse propensity weights. Instability of inverse propensity weights is a known problem that can lead to instability of estimators. - They propose a naive plug-in estimator for the optimal robust policy but note that it will su

Weaknesses

- The problem of robust offline policy learning under the marginal sensitivity model and Rosenbaum selection model is quite well-studied, e.g. Aronow and Lee 2013, Miratrix et al. 2018, Zhao et al. 2019, Yadlowsky et al. 2018, Kallus et al. 2018, Kallus and Zhou 2020. Furthermore, other works such as Bruns-Smith and Zhou, 2023 consider dynamic policy learning. So, the problem that the authors aim to solve has limited novelty. Nevertheless, this paper does cite and reference many of the relevant

Reviewer 02Rating 4Confidence 3

Strengths

Policy learning under unmeasured confounding is an important problem with broad applications. The authors identify a key gap in the literature and address it via appropriate methods. Presentation of the work is effective: I especially appreciate Figure 2 illustrating how the concept of sharpness connects to regret. The theoretical results - including identification bounds (4.1), bias-corrected estimator (4.2), and learning guarantees (4.3) are also well suited to this problem setting. The synthe

Weaknesses

## Connection to prior work & significance In general, the authors provide solid coverage of prior work and appropriately situate the contribution in the literature. However, this work can be viewed as a targeted improvement on top of the basic framework established in Kallus & Zhou (2018a; 2021). While I still believe such work is valuable and worthy of publication, this somewhat limits the significance of the results. More specifically, it would be helpful if the authors could provide more d

Reviewer 03Rating 6Confidence 3

Strengths

- Closed-form sharp bound for value under MSM - One-step bias-corrected estimator hits the efficiency bound - Learning guarantees to the optimal confounding-robust policy

Weaknesses

- The EIF and one-step estimator rely on quantiles $F_{x,a}^{-1}(\alpha_{\pm})$. You do not state standard conditions ensuring pathwise differentiability. - You claims the estimator “is semi-parametrically efficient” and points to D.2, which provides an influence function expression and cites a chain-rule lemma. But you never identify the canonical gradient in the nonparametric model nor verify your influence function equals it. - Theorem 4.4 needs a uniform bound, but the current version is poi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Smart Grid Energy Management