Causal isotonic calibration for heterogeneous treatment effects

Lars van der Laan; Ernesto Ulloa-P\'erez; Marco Carone; and Alex; Luedtke

arXiv:2302.14011·stat.ML·June 7, 2023

Causal isotonic calibration for heterogeneous treatment effects

Lars van der Laan, Ernesto Ulloa-P\'erez, Marco Carone, and Alex, Luedtke

PDF

Open Access 1 Repo

TL;DR

This paper introduces causal isotonic calibration, a nonparametric method for calibrating heterogeneous treatment effect predictors, and a data-efficient cross-calibration variant that ensures robust, distribution-free calibration with minimal data requirements.

Contribution

It presents a novel calibration method that can be applied to any black-box predictor, with theoretical guarantees under weak conditions, improving treatment effect estimation accuracy.

Findings

01

Achieves fast doubly-robust calibration rates

02

Works without assuming monotonicity

03

Can be wrapped around any black-box model

Abstract

We propose causal isotonic calibration, a novel nonparametric method for calibrating predictors of heterogeneous treatment effects. Furthermore, we introduce cross-calibration, a data-efficient variant of calibration that eliminates the need for hold-out calibration sets. Cross-calibration leverages cross-fitted predictors and generates a single calibrated predictor using all available data. Under weak conditions that do not assume monotonicity, we establish that both causal isotonic calibration and cross-calibration achieve fast doubly-robust calibration rates, as long as either the propensity score or outcome regression is estimated accurately in a suitable sense. The proposed causal isotonic calibrator can be wrapped around any black-box learning algorithm, providing robust and distribution-free calibration guarantees while preserving predictive performance.

Tables3

Table 1. Table 1 : Information on the set of estimators used by the Super Learner to estimate the pseudo-outcome components. Abbreviations: generalized additive models (GAM), generalized linear model (GLM), generalized linear model with lasso regularization (GLMnet), gradient boosted trees (GBRT), random forests (RF), multivariate adaptive regression splines (MARS).

scenario	library for $μ_{0}$	library for $π_{0}$
1	logistic regression, GLMnet, GAM,	logistic regression, GLMnet, GAM,
	GBRT with depth $\in {2, 3, 5, 6, 8}$ ,	GBRT with depth $\in {2, 4, 6}$
	RF, MARS
2	GLMnet	GLMnet

Table 2. Table 2: Scenario 1 bias within bins of predictions for the calibrated and uncalibrated estimators. Each row shows the resulting bias for a given CATE estimator, and the Cal column indicates if it is calibrated or not. The columns are organized by sample size, and within each sample size, we show the results for the bias in the upper and lower deciles. Abbreviations: calibrated (cal), estimator (est), generalized additive models (GAM), generalized linear model (GLM), generalized linear model with lasso regularization (GLMnet), gradient boosted regression trees (GBRT), random forests (RF), multivariate adaptive regression splines (MARS).

Sample Size

1000

2000

5000

10000

Cal

CATE

estimator

Lower

Decile

Upper

Decile

Lower

Decile

Upper

Decile

Lower

Decile

Upper

Decile

Lower

Decile

Upper

Decile

yes

MARS

-0.01

-0.02

0.01

-0.01

0

-0.02

0

-0.01

no

MARS

-0.23

0.23

-0.13

0.14

-0.06

0.06

-0.02

0.03

yes

GAM

-0.04

0.02

-0.01

0.03

0

0.01

0

no

GAM

-0.08

0.04

-0.04

0.01

-0.02

0

-0.01

yes

GLM

-0.05

0.04

-0.02

0.03

-0.02

0.02

-0.01

0.02

no

GLM

-0.02

0.05

0.02

0.03

0.02

0.01

0.03

0.02

yes

GLMnet

-0.05

0.04

-0.02

0.02

-0.02

0.02

-0.01

0.02

no

GLMnet

0

0.03

0.02

0.03

0.01

0.03

0.01

yes

RF

-0.06

0.03

-0.01

0.02

-0.01

0

no

RF

-0.34

0.34

-0.3

0.31

-0.28

0.27

-0.24

0.25

yes

GBRT 2

-0.03

0

-0.01

0

no

GBRT 2

-0.15

0.14

-0.05

0.05

-0.01

-0.03

0.01

-0.04

yes

GBRT 5

-0.01

-0.03

0.03

-0.06

0.03

-0.06

0.03

-0.05

no

GBRT 5

-0.49

0.51

-0.34

0.37

-0.19

0.2

-0.1

0.12

yes

GBRT 8

-0.02

-0.03

0.02

-0.06

0.05

-0.09

0.05

-0.09

no

GBRT 8

-0.67

0.74

-0.54

0.6

-0.39

0.42

-0.27

0.32

Table 3. Table 3: Scenario 2 bias within bins of predictions for the calibrated and uncalibrated estimators. Each row shows the resulting bias for a given CATE estimator, and the Cal column indicates if it is calibrated or not. The columns are organized by sample size, and within each sample size, we show the results for the bias in the upper and lower deciles. Abbreviations: calibrated (cal), generalized linear model with lasso regularization (GLMnet), gradient boosted regression trees with GLMNet screening (GLMNet scr + GBRT).

Sample Size

1000

2000

5000

10000

Cal

CATE

estimator

Lower

Decile

Upper

Decile

Lower

Decile

Upper

Decile

Lower

Decile

Upper

Decile

Lower

Decile

Upper

Decile

yes

GLMnet

-0.01

0.01

-0.04

-0.01

-0.04

-0.01

-0.03

-0.01

no

GLMnet

-0.11

-0.07

-0.11

-0.06

-0.08

-0.04

-0.07

-0.03

yes

GLMnet scr

+ GBRT

-0.11

-0.08

-0.12

-0.08

-0.12

-0.07

-0.1

-0.06

no

GLMnet scr

+ GBRT

0.09

0.03

0.05

0.01

0.04

0.01

0.03

0

yes

random

forest

-0.03

-0.01

-0.03

-0.02

-0.04

-0.02

-0.03

-0.02

no

random

forest

-0.8

-0.41

-0.72

-0.38

-0.62

-0.33

-0.54

-0.29

Equations174

E {[τ_{0} (W) - γ_{0} (τ, W)] I (γ_{0} (τ, W) \in [a, b))} = 0 .

E {[τ_{0} (W) - γ_{0} (τ, W)] I (γ_{0} (τ, W) \in [a, b))} = 0 .

CAL (τ) := \int [γ_{0} (τ, w) - τ (w)]^{2} d P_{W} (w) .

CAL (τ) := \int [γ_{0} (τ, w) - τ (w)]^{2} d P_{W} (w) .

MSE (τ) := ∥ τ_{0} - τ ∥^{2} = CAL (τ) + DIS (τ),

MSE (τ) := ∥ τ_{0} - τ ∥^{2} = CAL (τ) + DIS (τ),

logit P (Y = 1 ∣ τ (W) = t) = α + β t

logit P (Y = 1 ∣ τ (W) = t) = α + β t

χ_{0} : o \mapsto τ_{0} (w) + \frac{a - π _{0} ( w )}{π _{0} ( w ) [ 1 - π _{0} ( w )]} [y - μ_{0} (a, w)],

χ_{0} : o \mapsto τ_{0} (w) + \frac{a - π _{0} ( w )}{π _{0} ( w ) [ 1 - π _{0} ( w )]} [y - μ_{0} (a, w)],

θ \mapsto \frac{1}{n} i = 1 \sum n [χ_{n} (O_{i}) - θ \circ τ (W_{i})]^{2} .

θ \mapsto \frac{1}{n} i = 1 \sum n [χ_{n} (O_{i}) - θ \circ τ (W_{i})]^{2} .

θ_{n}^{*} = θ \in F_{i so} argmin i \in I_{ℓ} \sum [χ_{m} (O_{i}) - θ \circ τ (W_{i})]^{2}

θ_{n}^{*} = θ \in F_{i so} argmin i \in I_{ℓ} \sum [χ_{m} (O_{i}) - θ \circ τ (W_{i})]^{2}

θ_{n}^{*} = θ \in F_{i so} argmin i = 1 \sum n [χ_{n, j (i)} (O_{i}) - (θ \circ τ_{n, j (i)}) (W_{i})]^{2};

θ_{n}^{*} = θ \in F_{i so} argmin i = 1 \sum n [χ_{n, j (i)} (O_{i}) - (θ \circ τ_{n, j (i)}) (W_{i})]^{2};

CAL (τ_{n}^{*}) = O_{P} (ℓ^{- 2/3} + ∥ (π_{m} - π_{0}) (μ_{m} - μ_{0}) ∥^{2}) .

CAL (τ_{n}^{*}) = O_{P} (ℓ^{- 2/3} + ∥ (π_{m} - π_{0}) (μ_{m} - μ_{0}) ∥^{2}) .

CAL (τ_{n}^{*}) \leq k s = 1 \sum k CAL (τ_{n, s}^{*}),

CAL (τ_{n}^{*}) \leq k s = 1 \sum k CAL (τ_{n, s}^{*}),

O_{P} (n^{- 2/3} + 1 \leq s \leq k max ∥ (π_{n, s} - π_{0}) (μ_{n, s} - μ_{0}) ∥^{2})

O_{P} (n^{- 2/3} + 1 \leq s \leq k max ∥ (π_{n, s} - π_{0}) (μ_{n, s} - μ_{0}) ∥^{2})

τ_{0}^{*} := θ \circ τ : θ \in F_{i so} argmin ∥ τ_{0} - θ \circ τ ∥ .

τ_{0}^{*} := θ \circ τ : θ \in F_{i so} argmin ∥ τ_{0} - θ \circ τ ∥ .

∥ τ_{n}^{*} - τ_{0}^{*} ∥ = O_{P} (ℓ^{- 1/3} + ∥ (π_{m} - π_{0}) (μ_{m} - μ_{0}) ∥)

∥ τ_{n}^{*} - τ_{0}^{*} ∥ = O_{P} (ℓ^{- 1/3} + ∥ (π_{m} - π_{0}) (μ_{m} - μ_{0}) ∥)

MSE (τ_{n}^{*}) - MSE (τ) \leq O_{P} (ℓ^{- 1/3} + ∥ (π_{m} - π_{0}) (μ_{m} - μ_{0}) ∥) .

MSE (τ_{n}^{*}) - MSE (τ) \leq O_{P} (ℓ^{- 1/3} + ∥ (π_{m} - π_{0}) (μ_{m} - μ_{0}) ∥) .

θ_{n}^{*} = θ \in F_{i so} argmin \frac{1}{n} i = 1 \sum n [χ_{n, j (i)} (O_{i}) - (θ \circ τ) (W_{i})]^{2};

θ_{n}^{*} = θ \in F_{i so} argmin \frac{1}{n} i = 1 \sum n [χ_{n, j (i)} (O_{i}) - (θ \circ τ) (W_{i})]^{2};

B := m \in N sup E_{m} sup o \in O sup [∣ χ_{0} (o) ∣ + ∣ χ_{m} (o) ∣],

B := m \in N sup E_{m} sup o \in O sup [∣ χ_{0} (o) ∣ + ∣ χ_{m} (o) ∣],

J (δ, F) := \int_{0}^{δ} Q sup lo g N (ϵ, F, L_{2} (Q)) d ϵ,

J (δ, F) := \int_{0}^{δ} Q sup lo g N (ϵ, F, L_{2} (Q)) d ϵ,

i \in I_{ℓ} \sum [r \circ τ_{n}^{*} (W_{i})] [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] = 0 .

i \in I_{ℓ} \sum [r \circ τ_{n}^{*} (W_{i})] [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] = 0 .

\frac{d}{d\varepsilon}[R_{n}(\xi_{n}(\varepsilon))-R_{n}(\theta_{n}^{*})]\Big{|}_{\varepsilon=0}\geq 0\mbox{\ \ and\ \ }\frac{d}{d\varepsilon}[R_{n}(\xi_{n}(-\varepsilon))-R_{n}(\theta_{n}^{*})]\Big{|}_{\varepsilon=0}\leq 0\ .

\frac{d}{d\varepsilon}[R_{n}(\xi_{n}(\varepsilon))-R_{n}(\theta_{n}^{*})]\Big{|}_{\varepsilon=0}\geq 0\mbox{\ \ and\ \ }\frac{d}{d\varepsilon}[R_{n}(\xi_{n}(-\varepsilon))-R_{n}(\theta_{n}^{*})]\Big{|}_{\varepsilon=0}\leq 0\ .

2 i \in I_{ℓ} \sum 1 (τ (W_{i}) \geq \overset{u}{ˉ}_{j}) [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] \geq 0 \mbox an d 2 i \in I_{ℓ} \sum 1 (τ (W_{i}) \geq \overset{u}{ˉ}_{j}) [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] \leq 0,

2 i \in I_{ℓ} \sum 1 (τ (W_{i}) \geq \overset{u}{ˉ}_{j}) [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] \geq 0 \mbox an d 2 i \in I_{ℓ} \sum 1 (τ (W_{i}) \geq \overset{u}{ˉ}_{j}) [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] \leq 0,

i \in I_{ℓ} \sum s (W_{i}) [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] = 0

i \in I_{ℓ} \sum s (W_{i}) [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] = 0

i \in I_{ℓ} \sum r \circ τ_{n}^{*} (W_{i}) [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] = 0 .

i \in I_{ℓ} \sum r \circ τ_{n}^{*} (W_{i}) [τ_{n}^{*} (W_{i}) - χ_{m} (O_{i})] = 0 .

E [Y_{1} - Y_{0} ∣ (θ_{n}^{*} \circ τ) (W) = τ^{'}] = E [Y_{1} - Y_{0} ∣ τ (W) \in B_{τ^{'}}]

E [Y_{1} - Y_{0} ∣ (θ_{n}^{*} \circ τ) (W) = τ^{'}] = E [Y_{1} - Y_{0} ∣ τ (W) \in B_{τ^{'}}]

E [Y_{1} - Y_{0} ∣ τ (W) \in B_{τ^{'}}] = E [θ_{0} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}],

E [Y_{1} - Y_{0} ∣ τ (W) \in B_{τ^{'}}] = E [θ_{0} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}],

E [θ_{0} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}] = E [θ_{0}^{+} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}] - E [θ_{0}^{-} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}],

E [θ_{0} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}] = E [θ_{0}^{+} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}] - E [θ_{0}^{-} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}],

w \in W ess inf (θ_{0}^{+} \circ τ) (w) \leq E [θ_{0}^{+} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}] \leq w \in W ess sup (θ_{0}^{+} \circ τ) (w),

w \in W ess inf (θ_{0}^{+} \circ τ) (w) \leq E [θ_{0}^{+} \circ τ (W) ∣ τ (W) \in B_{τ^{'}}] \leq w \in W ess sup (θ_{0}^{+} \circ τ) (w),

E {[γ_{0} (τ_{n}^{*}, W) - τ_{n}^{*} (W)] [χ_{0} (O) - τ_{n}^{*} (W)] ∣ D_{n}}

E {[γ_{0} (τ_{n}^{*}, W) - τ_{n}^{*} (W)] [χ_{0} (O) - τ_{n}^{*} (W)] ∣ D_{n}}

= E {E {[γ_{0} (τ_{n}^{*}, W) - τ_{n}^{*} (W)] [χ_{0} (O) - τ_{n}^{*} (W)] ∣ W} ∣ D_{n}}

= E {[γ_{0} (τ_{n}^{*}, W) - τ_{n}^{*} (W)] [τ_{0} (W) - τ_{n}^{*} (W)] ∣ D_{n}}

= E {E {[γ_{0} (τ_{n}^{*}, W) - τ_{n}^{*} (W)] [τ_{0} (W) - τ_{n}^{*} (W)] ∣ τ_{n}^{*} (W)} ∣ D_{n}}

= E {[γ_{0} (τ_{n}^{*}, W) - τ_{n}^{*} (W)] [γ_{0} (τ_{n}^{*}, W) - τ_{n}^{*} (W)] ∣ D_{n}}

= E {[γ_{0} (τ_{n}^{*}, W) - τ_{n}^{*} (W)]^{2} ∣ D_{n}} .

\displaystyle\int\left\{\gamma_{0}(\tau_{n}^{*},w)-\tau_{n}^{*}(w)\right\}^{2}dP(w)\

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

larsvanderlaan/causalcalibration
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Causal Inference Techniques · Statistical Methods and Inference · Statistical Methods and Bayesian Inference

Full text

Causal isotonic calibration

for heterogeneous treatment effects

Lars van der Laan

Department of Statistics, University of Washington, USA

Ernesto Ulloa-Pérez

Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, USA

Marco Carone

Department of Biostatistics, University of Washington, USA

Department of Statistics, University of Washington, USA

Alex Luedtke

Department of Statistics, University of Washington, USA

Department of Biostatistics, University of Washington, USA

(Version 2: June 5, 2023)

Abstract

We propose causal isotonic calibration, a novel nonparametric method for calibrating predictors of heterogeneous treatment effects. Furthermore, we introduce cross-calibration, a data-efficient variant of calibration that eliminates the need for hold-out calibration sets. Cross-calibration leverages cross-fitted predictors and generates a single calibrated predictor using all available data. Under weak conditions that do not assume monotonicity, we establish that both causal isotonic calibration and cross-calibration achieve fast doubly-robust calibration rates, as long as either the propensity score or outcome regression is estimated accurately in a suitable sense. The proposed causal isotonic calibrator can be wrapped around any black-box learning algorithm, providing robust and distribution-free calibration guarantees while preserving predictive performance.

††* These authors contributed equally to this work.

1 Introduction

Estimation of causal effects via both randomized experiments and observational studies is critical to understanding the effects of interventions and informing policy. Moreover, it is often the case that understanding treatment effect heterogeneity can provide more insights than overall population effects (Obermeyer and Emanuel, 2016; Athey, 2017). For instance, a study of treatment effect heterogeneity can help elucidate the mechanism of an intervention, design policies targeted to subpopulations that can most benefit (Imbens and Wooldridge, 2009), and predict the effect of interventions in populations other than the ones in which they were developed. These necessities have arisen in a wide range of fields, such as marketing (Devriendt et al., 2018), the social sciences (Imbens and Wooldridge, 2009), and the health sciences (Kent et al., 2018). For example, in the health sciences, heterogeneous treatment effects (HTEs) are of high importance to understanding and quantifying how certain exposures or interventions affect the health of various subpopulations (Dahabreh et al., 2016; Lee et al., 2020). Potential applications include prioritizing treatment to certain sub-populations when treatment resources are scarce, or individualizing treatment assignments when the treatment can have no effect (or even be harmful) in certain subpopulations (Dahabreh et al., 2016). As an example, treatment assignment based on risk scores has been used to provide clinical guidance in cardiovascular disease prevention (Lloyd-Jones et al., 2019) and to improve decision-making in oncology (Collins and Varmus, 2015; Cucchiara et al., 2018).

A wide range of statistical methods are available for assessing HTEs, with recent examples including Wager and Athey (2018), Carnegie et al. (2019), Lee et al. (2020), Yadlowsky et al. (2021), and Nie and Wager (2021), among others. In particular, many methods, including Imbens and Wooldridge (2009) and Dominici et al. (2020), scrutinize HTEs via conditional average treatment effects (CATEs). The CATE is the difference in the conditional mean of the counterfactual outcome corresponding to treatment versus control given covariates, which can be defined at a group or individual level. When interest lies in predicting treatment effect, the CATE can be viewed as the oracle predictor of the individual treatment effect (ITE) that can feasibly be learned from data. Optimal treatment rules have been derived based on the sign of the CATE estimator (Murphy, 2003; Robins, 2004), with more recent works incorporating the use of flexible CATE estimators (Luedtke and van der Laan, 2016). Thus, due to its wide applicability and scientific relevance, CATE estimation has been of great interest in statistics and data science.

Regardless of its quality as a proxy for the true CATE, it is generally accepted that predictions from a given treatment effect predictor can still be useful for decision-making. However, theoretical guarantees for rational decision-making using a given treatment effect predictor typically hinge on the predictor being a good approximation of the true CATE. Accurate CATE estimation can be challenging because the nuisance parameters involved can be non-smooth, high-dimensional, or otherwise difficult to model correctly. Additionally, a CATE estimator obtained from samples of one population, regardless of its quality, may not generalize well to different target populations (Frangakis, 2009). Usually, CATE estimators (often referred to as learners) build upon estimators of the conditional mean outcome given covariates and treatment level (i.e., outcome regression), the probability of treatment given covariates (i.e., propensity score), or both. For instance, plug-in estimators such as those studied in Künzel et al. (2019) — so-called T-learners — are obtained by taking the difference between estimators of the outcome regression obtained separately for each treatment level. T-learners can suffer in performance because they rely on estimation of nuisance parameters that are at least as non-smooth or high-dimensional as the CATE, and are prone to the misspecification of involved outcome regression models; these issues can result in slow convergence or inconsistency of the CATE estimator. Doubly-robust and Neyman-orthogonal CATE estimation strategies like the DR-learner and R-learner (Wager and Athey, 2018; Foster and Syrgkanis, 2019; Nie and Wager, 2021; Kennedy, 2020) mitigate some of these issues by allowing for comparatively fast CATE estimation rates even when nuisance parameters are estimated at slow rates. However, while less sensitive to the learning complexity of the nuisance parameters, their predictive accuracy in finite-samples still relies on potentially strong smoothness assumptions on the CATE. Even when the CATE is estimated consistently, predictions based on statistical learning methods often produce biased predictions that overestimate or underestimate the true CATE in the extremes of the predicted values (van Klaveren et al., 2019; Dwivedi et al., 2020). For example, the ‘pooled cohort equations’ (Goff et al., 2014) risk model used to predict cardiovascular disease has been found to underestimate risk in patients with lower socioeconomic status or chronic inflammatory diseases (Lloyd-Jones et al., 2019). The implications of biased treatment effect predictors are profound when used to guide treatment decisions and can range from harmful use to withholding of treatment (van Calster et al., 2019).

Due to the consequence of treatment decision-making, it is essential to guarantee, under minimal assumptions, that treatment effect predictions are representative in magnitude and sign of the actual effects, even when the predictor is a poor approximation of the CATE. In prediction settings, the aim of bestowing these properties on a given predictor is commonly called calibration. A calibrated treatment effect predictor has the property that the average treatment effect among individuals with identical predictions is close to their shared prediction value. Such a predictor is more robust against the over-or-under estimation of the CATE in extremes of predicted values. It also has the property that the best predictor of the ITE given the predictor is the predictor itself, which facilitates transparent treatment decision-making. In particular, the optimal treatment rule (Murphy, 2003) given only information provided by the predictor is the one that assigns the treatment predicted to be most beneficial. Consequently, the rule implied by a perfectly calibrated predictor is at least as favorable as the best possible static treatment rule that ignores HTEs. While complementing one another, the aims of calibration and prediction are fundamentally different. For instance, a constant treatment effect predictor can be well-calibrated even though it is a poor predictor of treatment effect heterogeneity (Gupta et al., 2020). In view of this, calibration methods are typically designed to be wrapped around a given black-box prediction pipeline to provide strong calibration guarantees while preserving predictive performance, thereby mitigating several prediction challenges mentioned previously.

In the machine learning literature, calibration has been widely used to enhance prediction models for classification and regression (Bella et al., 2010). However, due to the comparatively little research on calibration of treatment effect predictors, such benefits have not been realized to the same extent in the context of heterogeneous treatment effect prediction. Several works have contributed to addressing this gap in the literature. Brooks et al. (2012) propose a targeted (or debiased) machine learning framework (van der Laan and Rose, 2011) for within-bins calibration that could be applied to the CATE setting. Zhang et al. (2016) and Josey et al. (2022) consider calibration of marginal treatment effect estimates for new populations but do not consider CATEs. Dwivedi et al. (2020) consider estimating calibration error of CATE predictors for subgroup discovery using randomized experimental data. Chernozhukov et al. (2018) and Leng and Dimmery (2021) propose CATE methods for linear calibration, a weaker form of calibration, in randomized experiments. For causal forests, Athey and Wager (2019) evaluate model calibration using a doubly-robust estimator of the ATE among observations above or below the median predicted CATE. Lei and Candès (2021) propose conformal inference methods for constructing calibrated prediction intervals for the ITE from a given predictor but do not consider calibration of the predictor itself. Xu and Yadlowsky (2022) propose a nonparametric doubly-robust estimator of the calibration error of a given treatment effect predictor, which could be used to detect uncalibrated predictors. Our work builds upon the above works by providing a nonparametric doubly-robust method for calibrating treatment effect predictors in general settings.

This paper is organized as follows. In Section 2, we introduce our notation and formally define calibration. There we also provide an overview of traditional calibration methods. In Section 3, we outline our proposed approach, and we describe its theoretical properties in Section 4. In Section 5, we examine the performance of our method in simulations.

2 Statistical Setup

2.1 Notation and Definitions

Suppose we observe $n$ independent and identically distributed realizations of data unit $O:=(W,A,Y)$ drawn from a distribution $P$ , where $W\in\mathcal{W}\subset\mathbb{R}^{d}$ is a vector of baseline covariates, $A\in\{0,1\}$ is a binary indicator of treatment, and $Y\in\mathcal{Y}\subset\mathbb{R}$ is an outcome. For instance, $W$ can include a patient’s demographic characteristics and medical history, $A$ can indicate whether an individual is treated (1) or not (0), and $Y$ could be a binary indicator of a successful clinical outcome. We denote by $\mathcal{D}_{n}:=\{O_{1},O_{2},\ldots,O_{n}\}$ the observed dataset, with $O_{i}:=(W_{i},A_{i},Y_{i})$ representing the observation on the $i^{th}$ study unit.

For covariate value $w\in\mathcal{W}$ and treatment level $a\in\{0,1\}$ , we denote by $\pi_{0}(w):=P(A=1|W=w)$ the propensity score and by $\mu_{0}(a,w):=E(Y\,|\,A=a,W=w)$ the outcome regression. The individual treatment effect is $Y_{1}-Y_{0}$ , where $Y_{a}$ represents the potential outcome obtained by setting $A=a$ . As convention, we take higher values of $Y_{1}-Y_{0}$ to be desirable. We assume that the contrast $\tau_{0}(w):=\mu_{0}(1,w)-\mu_{0}(0,w)$ equals the true CATE, $E(Y_{1}-Y_{0}\,|\,W=w)$ , which holds under causal assumptions (Rubin, 1974). Throughout, we denote by $\|\cdot\|$ the $L^{2}(P)$ norm, that is, $\|f\|^{2}=\int[f(w)]^{2}dP_{W}(w)$ for any given $P_{W}$ -square integrable function $f:\mathcal{W}\rightarrow\mathbb{R}$ , where $P_{W}$ is the marginal distribution of $W$ implied by $P$ . We deliberately take as convention that the median $\text{median}\{x_{1},x_{2},\dots,x_{k}\}$ of a set $\{x_{1},x_{2},\dots,x_{k}\}$ equals the $\lfloor k/2\rfloor^{th}$ order statistic of this set, where $\lfloor k/2\rfloor:=\max\{z\in\mathbb{N}:z\leq k/2\}$ .

Let $\tau:\mathcal{W}\rightarrow\mathbb{R}$ be a treatment effect predictor, that is, a function that maps a realization $w$ of $W$ to a treatment effect prediction $\tau(w)$ . In practice, $\tau$ can be obtained using any black-box algorithm. Below, we first consider $\tau$ to be fixed, though we later address situations in which $\tau$ is learned from the data used for subsequent calibration. We define the calibration function $\gamma_{0}(\tau,w):=E[Y_{1}-Y_{0}|\tau(W)=\tau(w)]$ as the conditional mean of the individual treatment effect given treatment effect score value $\tau(w)$ . By the tower property, $\gamma_{0}(\tau,w)=E[\tau_{0}(W)\,|\,\tau(W)=\tau(w)]$ , and so, expectations only involving $\gamma_{0}(\tau,W)$ and other functions of $W$ can be taken with respect to $P_{W}$ .

The solution to an isotonic regression problem is typically nonunique. Throughout this text, we follow Groeneboom and Lopuhaa (1993) in taking the unique càdlàg piece-wise constant solution of the isotonic regression problem that can only take jumps at observed values of the predictor.

2.2 Measuring Calibration and the Calibration-Distortion Decomposition

Various definitions of risk predictor calibration have been proposed in the literature — see Gupta and Ramdas (2021) and Gupta et al. (2020) for a review. Here, we outline our definition of calibration and its rationale. Given a treatment effect predictor $\tau$ , the best predictor of the individual treatment effect in terms of MSE is $w\mapsto\gamma_{0}(\tau,w):=E[Y_{1}-Y_{0}\,|\,\tau(W)=\tau(w)]$ . By the law of total expectation, this predictor has the property that, for any interval $[a,b)$ ,

[TABLE]

Equation 1 indicates that $\gamma_{0}(\tau,\cdot)$ is perfectly calibrated on $[a,b)$ . Therefore, when a given predictor $\tau$ is such that $\tau(W)=\gamma_{0}(\tau,W)$ with $P$ -probability one, $\tau$ is said to be perfectly calibrated (Gupta et al., 2020) for the CATE — for brevity, we omit “for the CATE” hereafter when the type of calibration being referred to is clear from context.

In general, perfect calibration cannot realistically be achieved in finite samples. A more modest goal is for the predictor $\tau$ to be approximately calibrated in that $\tau(w)$ is close to $\gamma_{0}(\tau,w)$ across all covariate values $w\in\mathcal{W}$ . This naturally suggests the calibration measure:

[TABLE]

This measure, referred to as the $\ell_{2}$ -expected calibration error, arises both in prediction (Gupta et al., 2020) and in the assessment of treatment effect heterogeneity (Xu and Yadlowsky, 2022). We note that $\text{CAL}(\tau)$ is zero if $\tau$ is perfectly calibrated. Additionally, averaging in $\text{CAL}(\tau)$ with respect to measures other than $P_{W}$ could be more relevant in certain applications; such cases can occur, for instance, when there is a change of population that results in covariate shift and we are interested in measuring how well $\tau$ is calibrated in the new population.

Interestingly, the above calibration measure plays a role in a decomposition of the mean squared error (MSE) between the treatment predictor and the true CATE, in that

[TABLE]

with $\text{DIS}(\tau):=E\{var[\tau_{0}(W)\,|\,\tau(W)]\}$ a quantity we term the distortion of $\tau$ . We refer to the above as a calibration-distortion decomposition of the MSE. A consequence of the calibration-distortion decomposition is that MSE-consistent CATE estimators are also calibrated asymptotically. However, particularly in settings where the covariates are high-dimensional or the CATE is nonsmooth, the calibration error rate for such predictors can be arbitrarily slow — this is discussed further after Theorem 1.

To interpret $\text{DIS}(\tau)$ , we find it helpful to envision a scenario in which a distorted message is passed between two persons. The goal is for Person 2 to discern the value of $\tau_{0}(w)$ , where the value of $w\in\mathcal{W}$ is only known to Person 1. Person 1 transmits $w$ , which is then distorted through a function $\tau$ and received by Person 2. Person 2 knows the functions $\tau$ and $\tau_{0}$ , and may use this information to try to discern $\tau_{0}(w)$ . If $\tau$ is one-to-one, $\tau_{0}(w)$ can be discerned by simply applying $\tau_{0}\circ\tau^{-1}$ to the received message $\tau(w)$ . More generally, whenever there exists a function $f$ such that $\tau_{0}=f\circ\tau$ , Person 2 can recover the value of $\tau_{0}(w)$ . For example, if $\tau=\tau_{0}$ then $f$ is the identity function. If no such function $f$ exists, it may not be possible for Person 2 to recover the value of $\tau_{0}(w)$ . Instead, they may predict $\tau_{0}(w)$ based on $\tau(w)$ via $\gamma_{0}(\tau,w)$ . Averaged over $W\sim P_{W}$ , the MSE of this approach is precisely $\text{DIS}(\tau)$ . See Equation 3 in Kuleshov and Liang (2015) for a related decomposition of $E\,[\{Y-\tau(X)\}^{2}]=\text{MSE}(\tau)+E\,[\{Y-\tau_{0}(X)\}^{2}]$ derived in the context of probability forecasting.

The calibration-distortion decomposition shows that, at a given level of distortion, better-calibrated treatment effect predictors have lower MSE for the true CATE function. We will explore this fact later in this work when showing that, in addition to improving calibration, our proposed calibration procedure can improve the MSE of CATE predictors.

2.3 Calibrating Predictors: desiderata and classical methods

In most calibration methods, the key goal is to find a function $\theta:\mathbb{R}\rightarrow\mathbb{R}$ of a given predictor $\tau$ such that $\text{CAL}(\theta\circ\tau)<\text{CAL}(\tau)$ , where $\theta\circ\tau$ refers to the composed predictor $w\mapsto\theta(\tau(w))$ . A mapping $\theta$ that pursues this objective is referred to as a calibrator. Ideally, a calibrator $\theta_{n}$ for $\tau$ constructed from the dataset $\mathcal{D}_{n}$ should satisfy the following desiderata:

Property 1:

$\text{CAL}(\theta_{n}\circ\tau)$ tends to zero quickly as $n$ grows; 2. Property 2:

$\theta_{n}\circ\tau$ and $\tau$ are comparably predictive of $\tau_{0}$ .

Property 1 states the primary objective of a calibrator, that is, to yield a well-calibrated predictor. Property 2 requires that the calibrator not destroy the predictive power of the initial predictor in the pursuit of Property 1, which would occur if the calibration term in decomposition (3) were made small at the cost of dramatic inflation of the distortion term.

In the traditional setting of classification and regression, a natural aim is to learn, for $a\in\{0,1\}$ , a predictor $w\mapsto\nu^{(a)}(w)$ of the outcome $Y$ among individuals with treatment $A=a$ . The best possible such predictor is given by the treatment-specific outcome regression $w\mapsto\mu_{0}(a,w)$ . For $a\in\{0,1\}$ , $\nu^{(a)}$ is said to be calibrated for the outcome regression if $\nu^{(a)}(w)\approx E(Y\mid\nu^{(a)}(W)=\nu^{(a)}(w),A=a)$ for $P_{0}$ -almost every $w$ . Such a calibrated predictor can be obtained using existing calibration methods for regression (Huang et al., 2020), which we review in the next paragraph. It is natural to wonder, then, whether existing calibration approaches can be directly used to calibrate for the CATE. As a concrete example, given predictors $\nu^{(1)}$ and $\nu^{(0)}$ of $\mu_{0}(1,\cdot)$ and $\mu_{0}(0,\cdot)$ , a natural CATE predictor is the T-learner $\tau:=\nu^{(1)}-\nu^{(0)}$ . However, even if $\nu^{(1)}$ and $\nu^{(0)}$ are calibrated for their respective outcome regressions, the predictor $\tau$ can still be poorly calibrated for the CATE. Indeed, in settings with treatment-outcome confounding, T-learners can be poorly calibrated when the calibrated predictors $\nu^{(1)}$ and $\nu^{(0)}$ are poor approximations of their respective outcome regressions. As an extreme example, suppose that $\nu^{(a)}$ equals the constant predictor $w\mapsto E(Y\mid A=a)$ for $a\in\{0,1\}$ , which is perfectly calibrated for the outcome regression. Then, the corresponding T-learner $\tau(\cdot)=E(Y\mid A=1)-E(Y\mid A=0)$ typically has poor calibration for the CATE in observational settings.

In classification and regression settings (Huang et al., 2020), the most commonly used calibration methods include Platt’s scaling (Platt et al., 1999), histogram binning (Zadrozny and Elkan, 2001), Bayesian binning into quantiles (Naeini et al., 2015), and isotonic calibration (Zadrozny and Elkan, 2002; Niculescu-Mizil and Caruana, 2005). Broadly, Platt’s scaling is designed for binary outcomes and uses the estimated values of the predictor to fit the logistic regression model

[TABLE]

with $\alpha,\beta\in\mathbb{R}$ . While it typically satisfies Property 2, Platt’s scaling is based on strong parametric assumptions and, as a consequence, may lead to predictions with significant calibration error, even asymptotically (Gupta et al., 2020). Nevertheless, Platt’s scaling may be preferred when limited data is available. Histogram or quantile binning involves partitioning the sorted values of the predictor into a fixed number of bins. Given an initial prediction, the calibrated prediction is the empirical mean of the observed outcome values within the corresponding prediction bin. A significant limitation of histogram binning is that it requires a priori specification of the number of bins. Selecting too few bins can significantly degrade the predictive power of the calibrated predictor, whereas selecting too many bins can lead to poor calibration. Bayesian binning improves upon histogram binning by considering multiple binning models and their combinations; nevertheless, it still requires pre-specification of binning models and prior distributions.

Isotonic calibration is a histogram binning method that learns the bins from data using isotonic regression, a nonparametric method traditionally used for estimating monotone functions (Barlow and Brunk, 1972; Martino et al., 2019; Huang et al., 2020). Specifically, the bins are selected by minimizing an empirical MSE criterion under the constraint that the calibrated predictor is a nondecreasing monotone transformation of the original predictor. Isotonic calibration is motivated by the heuristic that, for a good predictor $\tau$ , the calibration function $\gamma_{0}(\tau,\cdot)$ should be approximately monotone as a function of $\tau$ . For instance, when $\tau=\tau_{0}$ , the map $\tau_{0}\mapsto\gamma_{0}(\tau_{0},\cdot)=\tau_{0}$ is the identity function. Despite its popularity and strong performance in practice (Zadrozny and Elkan, 2002; Niculescu-Mizil and Caruana, 2005; Gupta and Ramdas, 2021), to date, whether isotonic calibration satisfies distribution-free calibration guarantees remains an open question (Gupta, 2022). In this work, we will show that isotonic calibration satisfies a distribution-free calibration guarantee in the sense of Property 1. We further establish that Property 2 holds, in that the isotonic selection criterion ensures that the calibrated predictor is at least as predictive as the original predictor up to negligible error.

3 Causal Isotonic Calibration

In real-world experiments, Dwivedi et al. (2020) found empirically that state-of-the-art CATE estimators tend to be poorly calibrated. However, strikingly, the authors found that such CATE predictors can often still correctly rank the average treatment effect among subgroups defined by bins of the predicted effects. These findings support the heuristic that the calibration function $\gamma_{0}(\tau,\cdot)$ is often approximately monotone as a function of the predictor $\tau$ . This heuristic makes extending isotonic calibration to the CATE setting especially appealing since the monotonicity constraint ensures that the calibrated predictions preserve the (non-strict) ranking of the original predictions.

Inspired by isotonic calibration, we propose a doubly-robust calibration method for treatment effects, which we refer to as causal isotonic calibration. Causal isotonic calibration takes a given predictor trained on some dataset and performs calibration using an independent (or hold-out) dataset. Mechanistically, causal isotonic calibration first automatically learns uncalibrated regions of the given predictor. Calibrated predictions are then obtained by consolidating individual predictions within each region into a single value using a doubly-robust estimator of the ATE. In addition, we introduce a novel data-efficient variant of calibration which we refer to as cross-calibration. In contrast with the standard calibration approach, causal isotonic cross-calibration takes cross-fitted predictors and outputs a single calibrated predictor obtained using all available data. Our methods can be implemented using standard isotonic regression software.

Let $\tau$ be a given treatment effect predictor assumed, for now, to have been built using an external dataset, and suppose that $\mathcal{D}_{n}$ is the available calibration dataset. In general, we can calibrate the predictor $\tau$ using regression-based calibration methods by employing an appropriate surrogate outcome for the CATE. For both experimental and observational settings, a surrogate outcome with favorable efficiency and robustness properties is the pseudo-outcome $\chi_{0}(O)$ defined via the mapping

[TABLE]

with $o:=(w,a,y)$ representing a realization of the data unit. This pseudo-outcome has been used as surrogate for the CATE in previous methods for estimating $\tau_{0}$ , including the DR-learner (Luedtke and van der Laan, 2016; Kennedy, 2020). If $\chi_{0}$ were known, an external predictor $\tau$ could be calibrated using $\mathcal{D}_{n}$ by isotonic regression of the pseudo-outcomes $\chi_{0}(O_{1}),\chi_{0}(O_{2}),\ldots,\chi_{0}(O_{n})$ onto the calibration sample predictions $\tau(W_{1}),\tau(W_{2}),\ldots,\tau(W_{n})$ . However, $\chi_{0}$ depends on $\pi_{0}$ and $\mu_{0}$ , which are usually unknown and must be estimated.

A natural approach for calibrating treatment effect predictors using isotonic regression is as follows. First, define $\chi_{n}$ as the estimated pseudo-outcome function based on estimates $\mu_{n}$ and $\pi_{n}$ derived from $\mathcal{D}_{n}$ . Then, a calibrated predictor is given by $\theta_{n}\circ\tau$ , where the calibrator $\theta_{n}$ is found via isotonic regression as a minimizer over $\mathcal{F}_{iso}:=\{\theta:\mathbb{R}\rightarrow\mathbb{R};\;\theta\text{ is monotone nondecreasing}\}$ of the empirical least-squares risk function

[TABLE]

However, this optimization problem requires a double use of $\mathcal{D}_{n}$ : once, for creating the pseudo-outcomes $\chi_{n}(O_{i})$ , and a second time, in the calibration step. This double usage could lead to over-fitting (Kennedy, 2020), and so we recommend obtaining pseudo-outcomes via sample splitting or cross-fitting. Sample splitting involves randomly partitioning $\mathcal{D}_{n}$ into $\mathcal{E}_{m}\cup\mathcal{C}_{\ell}$ , with $\mathcal{E}_{m}$ used to estimate $\mu_{0}$ and $\pi_{0}$ , and $\mathcal{C}_{\ell}$ used to carry out the calibration step — see Algorithm 1 for details. Cross-fitting improves upon sample splitting by using all available data to estimate $\mu_{0}$ and $\pi_{0}$ as well as to carry out the calibration step. Algorithm 4, outlined in Appendix B, is the cross-fitted variant of Algorithm 1.

In practice, the external dataset used to construct $\tau$ for input into Algorithm 1 is likely to arise from a sample splitting approach wherein a large dataset is split in two, with one half used to estimate $\tau$ and the other to calibrate it. This naturally leads to the question of whether there is an approach that fully utilizes the entire dataset for both fitting an initial estimate of $\tau_{0}$ and calibration. Algorithm 2 describes causal isotonic cross-calibration, which provides a means to accomplish precisely this. In brief, this approach applies Algorithm 1 a total of $k$ times on different splits of the data, where for each split an initial predictor of $\tau_{0}$ is fitted based on the first subset of the data and this predictor is calibrated using the second subset. These $k$ calibrated predictors are then aggregated via a pointwise median. Interestingly, other aggregation strategies, such as pointwise averaging, can lead to uncalibrated predictions (Gneiting and Ranjan, 2013; Rahaman and Thiery, 2020). A computationally simpler variant of Algorithm 2 is given by Algorithm 3. In this implementation, a single isotonic regression is performed using the pooled out-of-fold predictions; this variant may also yield more stable performance in finite-samples than Algorithm 2 — see Section 2.1.2 of Xu and Yadlowsky (2022) for a related discussion in the context of debiased machine learning.

4 Large-Sample Theoretical Properties

We now present theory for causal isotonic calibration. We obtain results for causal isotonic calibration described by Algorithm 1 applied to a fixed predictor $\tau$ . We also establish MSE guarantees for the calibrated predictor and argue that the proposed calibrator satisfies Properties 1 and 2. We extend our results to the procedure of Algorithm 2.

For ease of presentation, we only establish theoretical results for the case where the nuisance estimators are obtained using sample splitting. With minor modifications, our results can be readily extended to cross-fitting by arguing along the lines of Newey and Robins (2018). In that spirit, we assume that the available data $\mathcal{D}_{n}$ is the union of a training dataset $\mathcal{E}_{m}$ and a calibration dataset $\mathcal{C}_{\ell}$ of sizes $m$ and $\ell$ , respectively, with $n=m+\ell$ and $\min\{m,\ell\}\rightarrow\infty$ as $n\rightarrow\infty$ . Let $\tau_{n}^{*}$ be the calibrated predictor obtained from Algorithm 1 using $\tau$ , $\mathcal{E}_{m}$ and $\mathcal{C}_{\ell}$ where the estimated pseudo-outcome $\chi_{m}$ is obtained by substituting estimates $\pi_{m}$ and $\mu_{m}$ of $\pi_{0}$ and $\mu_{0}$ into (4).

Condition 1 (bounded outcome support).

The $P$ -support $\mathcal{Y}$ of $Y$ is a uniformly bounded subset of $\mathbb{R}$ .

Condition 2 (positivity).

There exists $\epsilon>0$ such that $P(\epsilon<\pi_{0}(W)<1-\epsilon)=1$ .

Condition 3 (independence).

Estimators $\pi_{m}$ and $\mu_{m}$ do not use any data in $\mathcal{C}_{\ell}$ .

Condition 4 (bounded range of $\pi_{m}$ , $\mu_{m}$ , $\tau$ ).

There exist $0<\eta,\alpha<\infty$ such that $P(\eta<\pi_{m}(W)<1-\eta)=P(|\mu_{m}(A,W)|<\alpha)=P(|\tau(W)|<\alpha)=1$ for $m=1,2,\ldots$

Condition 5 (bounded variation of best predictor).

The function $\theta_{0}:\mathbb{R}\mapsto\mathbb{R}$ such that $\theta_{0}\circ\tau=\gamma_{0}(\tau,\cdot)$ is of bounded total variation.

It is worth noting that the initial predictor and its best monotone transformation can be arbitrarily poor CATE predictors. Condition 1 holds trivially when outcomes are binary, but even continuous outcomes are often known to satisfy fixed bounds (e.g., physiologic bound, limit of detection of instrument) in applications. Condition 2 is standard in causal inference and requires that all individuals have a positive probability of being assigned to either treatment or control. Condition 3 follows as a direct consequence of the sample splitting approach, because the estimators are obtained from an independent sample from the data used to carry the calibration step. Condition 4 requires that the estimators of the outcome regression and propensity score be bounded; this can be enforced, for example, by threshholding when estimating these regression functions. Condition 5 excludes cases in which the best possible predictor of the CATE given only the initial predictor $\tau$ has pathological behavior, in the sense that it has infinite variation norm as a (univariate) mapping of $\tau$ . We stress here that isotonic regression is used only as a tool for calibration, and our theoretical guarantees do not require any monotonicity on components of the data-generating mechanism — for example, $\gamma_{0}(\tau,w)$ need not be monotone as a function of $\tau(w)$ .

The following theorem establishes the calibration rate of the predictor $\tau_{n}^{*}$ obtained using causal isotonic calibration.

Theorem 1 ( $\tau_{n}^{*}$ is well-calibrated).

Under Conditions 1–5, as $n\rightarrow\infty$ , it holds that

[TABLE]

The calibration rate can be expressed as the sum of an oracle calibration rate and the rate of a second-order cross-product bias term involving nuisance estimators. Notably, the causal isotonic calibrator rate can satisfy Property 1 at the oracle rate $\ell^{-2/3}$ so long as $\left\lVert(\pi_{m}-\pi_{0})(\mu_{m}-\mu_{0})\right\rVert$ shrinks no slower than $\ell^{-1/3}$ , which requires that one or both of $\pi_{0}$ and $\mu_{0}$ is estimated well in an appropriate sense. If $\pi_{0}$ is known, as in most randomized experiments, the fast calibration rate of $\ell^{-2/3}$ can be achieved even when $\mu_{m}$ is inconsistent, thereby providing distribution-free calibration guarantees irrespective of the smoothness of the outcome regression or dimension of the covariate vector. When $\pi_{0}$ is unknown, the oracle rate of $\ell^{-2/3}$ may not be achievable if the propensity score and outcome regression are insufficiently smooth relative to the dimension of the covariate vector (Kennedy, 2020; Kennedy et al., 2022).

It is interesting to contrast the calibration guarantee in Theorem 1 with existing MSE guarantees for DR-learners (Kennedy, 2020) since, in view of (3), they also provide calibration guarantees. While the MSE estimation rates for the CATE depend on the dimension and smoothness of $\tau_{0}$ , the curse of dimensionality for our calibration rates only manifests itself in the doubly-robust cross-remainder term that involves nuisance estimation rates. For instance, when $\ell=m=n/2$ , if $\pi_{0}$ and $\mu_{0}$ are known to be Hölder smooth with exponent $\alpha\geq 1$ , the calibration rate implied by Theorem 1 with minimax optimal nuisance estimators is, up to logarithmic factors, $\ell^{-2/3}+\ell^{-4\alpha/(2\alpha+d)}$ . In contrast, if $\tau_{0}$ is known to be Hölder smooth with exponent $\beta\geq 1$ , a minimax optimal estimator of $\tau_{0}$ is only guaranteed to achieve an MSE, and therefore calibration, rate of $\ell^{-2\beta/(2\beta+d)}+\ell^{-4\alpha/(2\alpha+d)}$ (Kennedy et al., 2022). When the nuisance smoothness satisfies $\alpha\geq d/4$ , causal isotonic calibration can achieve the oracle calibration rate of $\ell^{-2/3}$ , whereas a minimax optimal CATE estimator is only guaranteed to achieve the same calibration rate under the stringent condition that the smoothness of $\tau_{0}$ satisfies $\beta\geq d$ .

The following theorem states that the predictor obtained by taking pointwise medians of calibrated predictors is also calibrated.

Theorem 2 (Pointwise median preserves calibration).

Let $\tau_{n,1}^{*},\tau_{n,2}^{*},\ldots,\tau_{n,k}^{*}$ be predictors, and define pointwise $\tau_{n}^{*}(w):=\text{median}\{\tau_{n,1}^{*}(w),\tau_{n,2}^{*}(w),\ldots,\tau_{n,k}^{*}(w)\}$ . Then,

[TABLE]

where the median operation is defined as in Section 2.1.

Under similar conditions, Theorem 2 combined with a generalization of Theorem 1 that handles random $\tau$ (see Theorem 7 in Appendix C.4) establishes that a predictor $\tau_{n}^{*}$ obtained using causal isotonic cross-calibration (Algorithm 2) has calibration error $\text{CAL}(\tau_{n}^{*})$ of order

[TABLE]

as $n\rightarrow\infty$ , where $\mu_{n,s}$ and $\pi_{n,s}$ are the outcome regression and propensity score estimators obtained after excluding the $s^{th}$ fold of the full dataset. In fact, Theorem 2 is valid for any calibrator of the form $\tau_{n}^{*}:w\mapsto\tau_{n,s_{n}(w)}^{*}(w)$ , where $s_{n}(w)$ is any random selector that may depend on the covariate value $w$ . This suggests that the calibration rate for the median-aggregated calibrator implied by Theorem 2 is conservative as it also holds for the worst-case oracle selector that maximizes calibration error.

We now establish that causal isotonic calibration satisfies Property 2, that is, it maintains the predictive accuracy of the initial predictor $\tau$ . In what follows, predictive accuracy is quantified in terms of MSE. At first glance, the calibration-distortion decomposition appears to raise concerns that causal isotonic calibration may distort $\tau$ so much that the predictive accuracy of $\tau_{n}^{*}$ may be worse than that of $\tau$ . This possibility may seem especially concerning given that the ouput of isotonic regression is a step function, so that there could be many $w,w^{\prime}\in\mathcal{W}$ such that $\tau(w)\not=\tau(w^{\prime})$ but $\tau_{n}^{*}(w)=\tau_{n}^{*}(w^{\prime})$ . The following theorem alleviates this concern by establishing that, up to a remainder term that decays with sample size, the MSE of $\tau_{n}^{*}$ is no larger than the MSE of the initial CATE predictor $\tau$ . A consequence of this theorem is that causal isotonic calibration does not distort $\tau$ so much as to destroy its predictive performance. To derive this result, we leverage that $\tau_{n}^{*}$ is in fact a misspecified DR-learner of the univariate CATE function $\gamma_{0}(\tau,\cdot)$ . While isotonic calibrated predictors are calibrated even when $\gamma_{0}(\tau,\cdot)$ is not a monotone function of $\tau$ , we stress that misspecified DR-learners for $\gamma_{0}(\tau,\cdot)$ are typically uncalibrated.

In the theorem below, we define the best isotonic approximation of the CATE given the initial predictor $\tau$ as

[TABLE]

Theorem 3 (Causal isotonic calibration does not inflate MSE much).

Under Conditions 1–5,

[TABLE]

as $n\rightarrow\infty$ . As such, as $n\rightarrow\infty$ , the inflation in root MSE from causal isotonic calibration satisfies

[TABLE]

A similar MSE bound can be established for causal isotonic cross-calibration as defined in Algorithm 2.

5 Simulation Studies

5.1 Data-Generating Mechanisms

We examined the behavior of our proposal under two data-generating mechanisms. The first mechanism (Scenario 1) includes a binary outcome whose conditional mean is an additive function (on the logit scale) of non-linear transformations of four confounders with treatment interactions. The second mechanism (Scenario 2) includes instead a continuous outcome with conditional mean linear on covariates and treatment interactions, with more than 100 covariates of which only 20 are true confounders. In both scenarios, the propensity score follows a logistic regression model. All covariates were independent and uniformly distributed on $(-1,+1)$ . Sample sizes $1,000$ , $2,000$ , $5,000$ and $10,000$ were considered. Further details are given in Appendix D.1.

5.2 CATE Estimation

We employed the DR-learner algorithm, as outlined by Kennedy (2020), in combination with different supervised learning algorithms to generate uncalibrated predictors of the CATE. In Scenario 1, to estimate the CATE, we implemented gradient-boosted regression trees (GBRT) with maximum depths equal to 2, 5, and 8 (Chen and Guestrin, 2016), random forests (RF) (Breiman, 2001), generalized linear models with lasso regularization (GLMnet) (Friedman et al., 2010), generalized additive models (GAM) (Wood, 2017), and multivariate adaptive regression splines (MARS) (Friedman, 1991). In Scenario 2, we implemented RF, GLMnet, and a combination of variable screening with lasso regularization followed by GBRT with maximum depth determined via cross-validation. We used the implementation of these estimators found in R package sl3 (Coyle et al., 2021). Causal isotonic cross-calibration was implemented using the variant outlined in Algorithm 3. Further details are given in Appendix D.2.

5.3 Performance Metrics

We evaluated the performance of each causal isotonic cross-calibrated predictor relative to its corresponding uncalibrated predictor using three metrics: the calibration measure defined in (1), MSE, and the calibration bias within bins defined by the first and last prediction deciles. The calibration bias within bins is given by the measure in (2) standardized by the probability of falling within each bin. For each simulation iteration, the metric was estimated empirically using an independent sample $\mathcal{V}$ of size $n_{\mathcal{V}}=10^{4}$ . These metric estimates were then averaged across 1000 simulations. Details on these metrics are provided in Appendix D.3.

5.4 Simulation Results

Results from Scenario 1 are summarized in Figure 1(a). The predictors based on GLMnet and GAM happened to be well-calibrated, and so, causal isotonic calibration did not lead to substantial improvements in calibration error. In contrast, causal isotonic calibration of RF, MARS, and GBRT substantially decreased its calibration error, regardless of tree depth and sample size. In terms of MSE, calibration improved the predictive performance of RF, MARS, GBRT, and preserved the performance of GLMnet and GAM. The calibration bias within bins of prediction was generally smaller after calibration, with a more notable improvement on MARS, RF, and GBRT — see Table 2 in Appendix E.

Results from Scenario 2 are summarized in Figure 1(b). The predictors based on RF and GBRT with GLMnet screening were poorly calibrated, and causal isotonic calibration substantially reduced their calibration error. Calibration did not noticeably change the already small calibration error of the GLMnet predictions; however, calibration substantially reduced the calibration error within quantile bins of its predictions — see Table 3 in Appendix E. Finally, with respect to MSE, causal isotonic calibration improved the performance of RF and GBRT with variable screening, and yielded similar performance to GLMnet.

In Figure 2 of Appendix E, we compared calibration performance using hold-out sets to cross-calibration and found substantial improvements in MSE and calibration by using cross-calibration.

6 Conclusion

In this work, we proposed causal isotonic calibration as a novel method to calibrate treatment effect predictors. In addition, we established that the pointwise median of calibrated predictors is also calibrated. This allowed us to develop a data-efficient variant of causal isotonic calibration using cross-fitted predictors, thereby avoiding the need for a hold-out calibration dataset. Our proposed methods guarantee that, under minimal assumptions, the calibration error defined in (2) vanishes at a fast rate of $\ell^{-2/3}$ with little or no loss in predictive power, where $\ell$ denotes the number of observations used for calibration. This property holds regardless of how well the initial predictor $\tau$ approximates the true CATE function. To our knowledge, our method is the first in the literature to directly calibrate CATE predictors without requiring trial data or parametric assumptions. Potential applications of our method include data-driven decision-making with strong robustness guarantees. In future work, it would be interesting to study whether pairing causal isotonic cross-calibration with conformal inference (Lei and Candès, 2021) leads to improved ITE prediction intervals, and whether causal isotonic calibration and shape-constrained inference methods (Westling and Carone, 2020) can be used to construct confidence intervals for $\gamma_{0}(\tau_{n}^{*},\cdot)$ .

Our method has limitations. Its calibration guarantees require that either $\mu_{0}$ or $\pi_{0}$ be estimated sufficiently well. Flexible learning methods can be used to satisfy this condition. If $\pi_{0}$ is known, this condition can be trivially met. Hence, our method can be readily used to calibrate CATE predictors and characterize HTEs in clinical trials. For proper calibration, our method requires all confounders to be measured and adjusted for. In future work, it will be important to study CATE calibration in the context of unmeasured confounding. Our strategy could be adapted to construct calibrators for general learning tasks, including E-learning of the conditional relative risk (Jiang et al., 2019; Qiu et al., 2019), proximal causal learning (Tchetgen et al., 2020; Sverdrup and Cui, 2023), and instrumental variable-based learning (Okui et al., 2012; Syrgkanis et al., 2019).

In simulations, we found that causal isotonic cross-calibration led to well-calibrated predictors without sacrificing predictive performance; benefits were especially prominent in high-dimensional settings and for tree-based methods. This is of particularly high relevance given that regression trees have become popular for CATE estimation, due to both their flexibility (Athey and Imbens, 2016) and interpretability (Lee et al., 2020). We also found that cross-calibration substantially improved the MSE of the calibrated predictor relative to hold-out set approaches. In some cases, cross-calibration even improved upon the MSE of the uncalibrated predictor.

Though our focus was on treatment effect estimation, our theoretical arguments can be readily adapted to provide guarantees for isotonic calibration in regression and classification problems. Hence, we have provided an affirmative answer to the open question of whether it is possible to establish distribution-free calibration guarantees for isotonic calibration (Gupta, 2022).

Acknowledgements. Research reported in this publication was supported by NIH grants DP2-LM013340 and R01-HL137808. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

Appendix A Implementation of algorithms in R

R code implementing causal isotonic calibration with user-supplied (cross-fitted) nuisance estimates and predictions is provided in the Github package causalCalibration and can be found at https://github.com/Larsvanderlaan/causalCalibration.

Appendix B Algorithm for causal isotonic calibration with cross-fitted nuisance estimates

Appendix C Technical proofs

Unless stated otherwise, the function $\tau_{n}^{*}$ denotes a calibrated predictor obtained using Algorithm 1 with a predictor $\tau$ , training dataset $\mathcal{E}_{m}$ , and calibration dataset $\mathcal{C}_{\ell}=\mathcal{D}_{n}\backslash\mathcal{E}_{m}$ as described in Section 4.

C.1 Notation & definitions

Let $\mathcal{T}:=\{\tau(w):w\in\mathcal{W}\}$ denote the range of the predictor $\tau$ , which is a bounded subset of $\mathbb{R}$ by Condition 4. We redefine $\mathcal{F}_{iso}\subset\{\theta:\mathcal{T}\rightarrow\mathbb{R};\theta\text{ is monotone nondecreasing}\}$ to denote the family of nondecreasing functions on $\mathcal{T}$ uniformly bounded by

[TABLE]

where the second supremum is over all possible realizations of the training dataset $\mathcal{E}_{m}$ . We necessarily have that $B$ is nonrandom and finite by Lemma 5. Redefining $\mathcal{F}_{iso}$ to be bounded allows us to directly apply certain maximal inequalities for empirical processes indexed by $\mathcal{F}_{iso}$ . Since the isotonic regression estimator is obtained by locally averaging the pseudo-outcome $\chi_{m}$ (Barlow and Brunk, 1972), the unconstrained isotonic regression solution satisfies this bound and falls in the interior of this class almost surely. Moreover, $\mathcal{F}_{iso}$ is a convex subset of the space of monotone nondecreasing functions. Let $\mathcal{F}_{TV}\subset\{\theta:\mathbb{R}\rightarrow\mathbb{R};\theta\text{ is of bounded variation}\}$ denote the space of functions with total variation uniformly bounded by three times the total variation of the function $\theta_{0}$ where $\theta_{0}$ is as in condition 5. Additionally, let $\mathcal{F}_{\tau,iso}:=\left\{\theta\circ\tau:\mathcal{W}\rightarrow\mathbb{R};\theta\in\mathcal{F}_{iso}\right\}$ be the family of functions obtained by composing nondecreasing functions in $\mathcal{F}_{iso}$ with $\tau$ , and let $\mathcal{F}_{\tau,TV}:=\{\theta\circ\tau:\mathcal{W}\rightarrow\mathbb{R};\theta\in\mathcal{F}_{TV}\}$ be the family of functions obtained by composing functions in $\mathcal{F}_{TV}$ with $\tau$ . Let $\mathcal{F}_{Lip,m}:=\left\{o\mapsto[\tau_{2}(w)-\tau_{1}(w)][\chi_{m}(o)-\tau_{2}(w)]:\mathcal{O}\rightarrow\mathbb{R};\tau_{2}\in\mathcal{F}_{\tau,TV},\tau_{1}\in\mathcal{F}_{\tau,iso}\right\}$ , where $\chi_{m}$ is the estimated pseudo-outcome function. Finally, for a function class $\mathcal{F}$ , let $N(\epsilon,\mathcal{F},L_{2}(P))$ denote the $\epsilon-$ covering number (van der Vaart and Wellner, 1996) of $\mathcal{F}$ and define the uniform entropy integral of $\mathcal{F}$ by

[TABLE]

where the supremum is taken over all discrete probability distributions $Q$ . In contrast to the definition provided in van der Vaart and Wellner (1996), we do not define the uniform entropy integral relative to an envelope function for the function class $\mathcal{F}$ . We can do this since all function classes we consider are uniformly bounded. Thus, any uniformly bounded envelope function will only change the uniform entropy integral as defined in van der Vaart and Wellner (1996) by a constant.

In the results below, we will use the following empirical process notation: for a $P-$ measurable function $f$ , we denote $\int f(o)dP(o)$ by $Pf$ , and so, letting $P_{\ell}$ denote the empirical distribution of $\mathcal{C}_{\ell}$ , $P_{\ell}f$ equals $\frac{1}{\ell}\sum_{i\in\mathcal{I}_{\ell}}f(O_{i})$ with $\mathcal{I}_{\ell}$ indexing observations of $\mathcal{C}_{\ell}\subset\mathcal{D}_{n}$ . We also let $\left\lVert f\right\rVert_{P}^{2}:=Pf^{2}$ ; to simplify notation, we omit the dependency in $P$ and use $\left\lVert f\right\rVert^{2}$ instead of $\left\lVert f\right\rVert_{P}^{2}$ . Finally, for two quantities $x$ and $y$ , we use the expression $x\lessapprox y$ to mean that $x$ is upper bounded by $y$ times a universal constant that may only depend on global constants that appear in conditions 1-5

C.2 Technical lemmas

The following lemma is a key component of our proof of Theorem 1.

Lemma 4.

For a calibrated predictor $\tau_{n}^{*}$ obtained using Algorithm 1, and any real-valued function $r$ , we have that

[TABLE]

Proof.

Note that $\tau_{n}^{*}(w)$ can be expressed pointwise for any $w\in\mathcal{W}$ as $\theta_{n}^{*}\circ\tau(w)=a_{0}+\sum_{j=1}^{J}a_{j}1(\tau(w)\geq u_{j})$ for a piecewise constant function $\theta_{n}^{*}$ determined by coefficients $\{a_{j}\}_{j=0}^{J}$ and jump points $\{u_{j}\}_{j=1}^{J}$ (Barlow and Brunk, 1972). By monotonicity, we necessarily have $a_{0}\in\mathbb{R}$ and $\{a_{j}\}_{j=1}^{J}$ are positive coefficients.

Let $R_{n}(\theta):=\sum_{i\in\mathcal{I}_{\ell}}[\theta\circ\tau(W_{i})-\chi_{m}(O_{i})]^{2}$ denote the least-squares risk used in the isotonic regression step. Fix an arbitrary jump point $\bar{u}_{j}$ , and let $\xi_{n}:\mathbb{R}^{2}\rightarrow\mathbb{R}$ denote the function $\xi_{n}(\varepsilon,h):=\theta_{n}^{*}(h)+\varepsilon 1(h\geq\bar{u}_{j})$ . Note that $\delta>0$ can be chosen to be sufficiently small that, for all $|\varepsilon|\leq\delta$ , $h\mapsto\xi_{n}(\varepsilon,h)$ is nondecreasing — for instance, $\delta=\min\{a_{j}\}_{j=1}^{J}$ suffices. Thus, for sufficiently small $\delta>0$ , $h\mapsto\xi_{n}(\varepsilon,h)$ lies in the space of monotone nondecreasing function for all $|\varepsilon|\leq\delta$ . In a slight abuse of notation, we let $R_{n}(\xi_{n}(\varepsilon)):=\sum_{i\in\mathcal{I}_{\ell}}[\xi_{n}(\varepsilon,\tau(W_{i}))-\chi_{m}(O_{i})]^{2}$ and $R_{n}(\xi_{n}(-\varepsilon)):=\sum_{i\in\mathcal{I}_{\ell}}[\xi_{n}(-\varepsilon,\tau(W_{i}))-\chi_{m}(O_{i})]^{2}$ .

Now, because $\theta_{n}^{*}$ minimizes $\theta\mapsto R_{n}(\theta)$ over the space of monotone nondecreasing functions, for all $\varepsilon\geq 0$ , it holds that both $R_{n}(\xi_{n}(\varepsilon))-R_{n}(\tau_{n}^{*})\geq 0$ and $R_{n}(\xi_{n}(-\varepsilon))-R_{n}(\tau_{n}^{*})\geq 0$ . Moreover, when $\varepsilon=0$ , $R_{n}(\xi_{n}(0))-R_{n}(\tau_{n}^{*})=0$ . Therefore, if $\varepsilon$ is sufficiently close to 0, the derivative with respect to $\varepsilon$ of $R_{n}(\xi_{n}(\varepsilon))-R_{n}(\tau_{n}^{*})$ must be non-negative, and $R_{n}(\xi_{n}(-\varepsilon))-R_{n}(\tau_{n}^{*})$ must be non-positive. Hence, it must be true that

[TABLE]

This, in turn, implies that

[TABLE]

and so, it follows that $\sum_{i\in\mathcal{I}_{\ell}}1(\tau(W_{i})\geq\bar{u}_{j})\left[\tau_{n}^{*}(W_{i})-\chi_{m}(O_{i})\right]=0$ . Because the jump point $\bar{u}_{j}$ was arbitrary, we have that for all functions of the form $s(w)=b_{0}+\sum_{j=1}^{J}b_{j}1(\tau(w)\geq u_{j})$ with coefficients $\{b_{j}\}_{j=0}^{J}$ , we can show that

[TABLE]

by taking linear combinations of $1(\tau(w)\geq u_{j})$ and noting that the score equations are linear in $s$ . The main result of this lemma follows from the fact that, since both $\tau_{n}^{*}$ and $r\circ\tau_{n}^{*}$ can be expressed in this form, for any real-valued function $r$ , we have that

[TABLE]

∎

Lemma 5.

Conditions 1, 2 and 4 imply that the function classes $\mathcal{F}_{iso}$ , $\mathcal{F}_{\tau,TV}$ , $\mathcal{F}_{\tau,iso}$ and $\mathcal{F}_{Lip,m}$ are bounded.

Proof.

By Conditions 1, 2 and 4, we know that $\chi_{m}(o)$ is bounded uniformly over all observations $o\in\mathcal{O}$ and realizations of $\mathcal{E}_{m}$ , that is, there exists a finite fixed constant $B$ such that $\operatorname*{ess\,sup}_{m\in\mathbb{N},o\in\mathcal{O}}\chi_{m}(o)\leq B/2$ . Hence, as defined in the previous section, $\mathcal{F}_{iso}$ is uniformly bounded. Moreover, because $\mathcal{F}_{iso}$ is bounded, it directly implies that $\mathcal{F}_{\tau,iso}$ is bounded. Noting that functions of finite variation are bounded, in view of Condition 5, we have that $\mathcal{F}_{TV}$ is uniformly bounded by some constant that depends neither on $\theta$ nor $\tau$ . This implies that $\mathcal{F}_{\tau,TV}$ is uniformly bounded. Finally, because $\mathcal{F}_{\tau,TV}$ , $\mathcal{F}_{\tau,iso}$ , $\chi_{m}$ and the potential outcomes are uniformly bounded, the function class $\mathcal{F}_{Lip,m}$ is also uniformly bounded. ∎

Lemma 6.

Under conditions 5 and the conditions of Lemma 5, the function $\tau^{\prime}\mapsto E[Y_{1}-Y_{0}\,|\,\tau_{n}^{*}(W)=\tau^{\prime}]$ has total variation bounded above by three times the total variation of $\theta_{0}$ , where $\theta_{0}$ is as in Condition 5.

Proof.

Since the function $\theta_{n}^{*}$ is nondecreasing and piecewise constant, we have

[TABLE]

for the set $B_{\tau^{\prime}}:=\left\{z\in\mathcal{T}:\theta_{n}^{*}(z)=\tau^{\prime}\right\}$ , where $B_{\tau^{\prime}}=\{z\in\mathcal{T}:a(\tau^{\prime})\leq z<b(\tau^{\prime})\}$ for some endpoints $a(\tau^{\prime}),b(\tau^{\prime})\in\mathbb{R}$ . The law of total expectation further implies that

[TABLE]

where $\theta_{0}$ is such that $\theta_{0}\circ\tau(W)=\gamma_{0}(\tau,W)$ $P$ -almost surely. By Condition 5, the function $\theta_{0}$ is of bounded total variation. Heuristically, since $\tau^{\prime}\mapsto E[\theta_{0}\circ\tau(W)\,|\,\tau(W)\in B_{\tau^{\prime}}]$ is obtained by locally averaging $\theta_{0}$ within the bins $(B_{\tau^{\prime}}:\tau^{\prime})$ , its total variation should also be bounded. We show this formally as follows. Note first that

[TABLE]

where $\theta_{0}^{+}$ and $\theta_{0}^{-}$ are two bounded, nondecreasing functions satisfying the Jordan decomposition $\theta_{0}=\theta_{0}^{+}-\theta_{0}^{-}$ (Theorem 4, Section 5.2 of Royden, 1963). Moreover, we can choose $\theta_{0}^{+}$ such that $\theta_{0}^{+}(\infty)-\theta_{0}^{+}(-\infty)$ is equal to the total variation of $\theta_{0}$ . Since $\|\theta_{0}^{-}\|_{TV}=\|\theta_{0}-\theta_{0}^{+}\|_{TV}\leq\|\theta_{0}\|_{TV}+\|\theta_{0}^{+}\|_{TV}$ , we have that $\|\theta_{0}^{-}\|_{TV}$ is bounded by $2\|\theta_{0}\|_{TV}$ .

Since $\theta_{n}^{*}$ is nondecreasing, by definition, we have that $t_{1}<t_{2}$ implies that $x_{1}<x_{2}$ for any $x_{1}\in B_{t_{1}}$ and $x_{2}\in B_{t_{2}}$ . It follows that both $\tau^{\prime}\mapsto E[\theta_{0}^{+}\circ\tau(W)\,|\,\tau(W)\in B_{\tau^{\prime}}]$ and $\tau^{\prime}\mapsto E[\theta_{0}^{-}\circ\tau(W)\,|\,\tau(W)\in B_{\tau^{\prime}}]$ are nondecreasing; furthermore, they are also bounded. By Theorem 4 of Royden (1963), a function is of bounded variation if and only if it is the difference between two bounded nondecreasing functions. We conclude that $\tau^{\prime}\mapsto E[Y_{1}-Y_{0}\,|\,\theta_{n}^{*}\circ\tau(W)=\tau^{\prime}]=E[\theta_{0}^{+}\circ\tau(W)\,|\,\tau(W)\in B_{\tau^{\prime}}]-E[\theta_{0}^{-}\circ\tau(W)\,|\,\tau(W)\in B_{\tau^{\prime}}]$ is of bounded variation. Moreover, its total variation norm is bounded above by the sum of the total variation norm of $E[\theta_{0}^{+}\circ\tau(W)\,|\,\tau(W)\in B_{\tau^{\prime}}]$ and that of $E[\theta_{0}^{-}\circ\tau(W)\,|\,\tau(W)\in B_{\tau^{\prime}}]$ . We recall that the total variation of monotone functions is simply the difference between the left and right endpoints of the monotone function, and that

[TABLE]

and similarly for $\theta_{0}^{-}\circ\tau$ . As a consequence, the total variation norms of $E[\theta_{0}^{+}\circ\tau(W)\,|\,\tau(W)\in B_{\tau^{\prime}}]$ and $E[\theta_{0}^{-}\circ\tau(W)\,|\,\tau(W)\in B_{\tau^{\prime}}]$ are bounded by the total variation norm of $\theta_{0}^{+}$ and that of $\theta_{0}^{-}$ , respectively. Using the sublinearity of the total variation norm, we conclude that $\tau^{\prime}\mapsto E[Y_{1}-Y_{0}\,|\,\theta_{n}^{*}\circ\tau(W)=\tau^{\prime}]$ has total variation norm bounded above by $3\|\theta_{0}\|_{TV}$ . ∎

C.3 Proofs of theorems

Proof of Theorem 1

Proof.

Conditioning on $\mathcal{D}_{n}$ , we have that

[TABLE]

The above equality implies that

[TABLE]

Note that, by Lemma 4, for each real-valued function $r$ , $\tau_{n}^{*}$ satisfies the equation

[TABLE]

Setting $r(\tau^{\prime}):=E[Y_{1}-Y_{0}\,|\,\tau_{n}^{*}(W)=\tau^{\prime}]-\tau^{\prime}$ , we conclude that

[TABLE]

Subtracting the above score equation from the second summand in (6), we obtain that

[TABLE]

This may be written in shorthand as $\left\lVert\gamma_{0}(\tau_{n}^{*},\cdot)-\tau_{n}^{*}\right\rVert^{2}=(I)+(II)$ with

[TABLE]

In order to show the desired result, we will bound both $(I)$ and $(II)$ .

We can bound $(I)$ using the law of iterated conditional expectations and the Cauchy-Schwarz inequality. First, conditioning on $\mathcal{E}_{m}$ , we note that

[TABLE]

Next, we express the second norm in (C.3) in terms of $\left\lVert\pi_{m}-\pi_{0}\right\rVert$ and $\left\lVert\mu_{m}-\mu_{0}\right\rVert$ . Recalling that $E[\chi_{0}(O)\,|\,W=w]=\tau_{0}(w)$ , we have that

[TABLE]

By Condition 2, $P(1-\eta>\pi_{m}(W)>\eta)=1$ for some $\eta>0$ . The latter condition combined with the Cauchy-Schwarz inequality gives that $\left\lVert E[\chi_{0}(O)\,|\,W=\,\cdot\,]-E[\chi_{m}(O)\,|\,W=\,\cdot\,,\mathcal{E}_{m}]\right\rVert$ is bounded above by

[TABLE]

By Condition 2, we also have that for any $P$ -measurable function $h:\mathcal{W}\rightarrow\mathbb{R}$

[TABLE]

The same bound holds for $\int h(w)^{2}[\mu_{0}(0,w)-\mu_{m}(0,w)]^{2}dP(w)$ . Setting $h:w\mapsto\pi_{m}(w)-\pi_{0}(w)$ , we conclude

[TABLE]

Together, (C.3) and (LABEL:eq4:theo1) yield that $(I)$ is bounded above by

[TABLE]

We now find an upper bound for $(II)$ . We claim that, conditionally on $\mathcal{E}_{m}$ , the random functions appearing in this empirical process term are contained in fixed and uniformly bounded function classes. To see this, we note that $\tau_{n}^{*}=\theta_{n}^{*}\circ\tau$ for some $\theta_{n}^{*}\in\mathcal{F}_{iso}$ and, as a consequence, $\tau_{n}^{*}\in\mathcal{F}_{\tau,iso}$ , a uniformly bounded function class by Lemma 5, $P_{0}$ -almost surely. By Lemma 6, the function $w\mapsto\gamma_{0}(\tau_{n}^{*},w)$ falls in $\mathcal{F}_{\tau,TV}$ . This further implies that $o\mapsto\{E[Y_{1}-Y_{0}\,|\,\tau_{n}^{*}(W)=\tau_{n}^{*}(w)]-\tau_{n}^{*}(w)\}\{\chi_{m}(o)-\tau_{n}^{*}(w)\}\in\mathcal{F}_{Lip,m}$ , which is a uniformly bounded function class by Lemma 5.

Next, we let $C:=\operatorname*{ess\,sup}_{x\in\mathcal{T}}|\theta_{0}(x)|$ and define $K:=B+C$ , where we recall that $B:=\sup_{m\in\mathbb{N}}\sup_{\mathcal{E}_{m}}\operatorname*{ess\,sup}_{o\in\mathcal{O}}\left\{|\chi_{0}(o)|+|\chi_{m}(o)|\right\}$ . Furthermore, we set $\delta_{n}:=\left\lVert\gamma_{0}(\tau_{n}^{*},\cdot)-\tau_{n}^{*}\right\rVert$ , which is a random rate. For any given rate $\delta$ , we define

[TABLE]

As a consequence of the above, we have that $(II)\leq S_{n}(\delta_{n})$ . Due to the randomness in $\delta_{n}$ , the above cannot be further upper-bounded immediately. To bound the term above, we will take a $\delta>0$ that is deterministic conditional on $\mathcal{E}_{m}$ , and upper-bound $\phi_{n}(\delta):=E\left\{S_{n}(\delta)\right\}$ , where the expectation is also taken over $\mathcal{D}_{n}$ . To bound the above term, we will use empirical process techniques with the function classes $\mathcal{F}_{iso}$ , $\mathcal{F}_{\tau,TV}$ , $\mathcal{F}_{\tau,iso}$ and $\mathcal{F}_{Lip,m}$ . To do so, we must study the uniform entropy integral

[TABLE]

for each of these function classes. By Lemma 5, all these function classes are uniformly bounded. We note that, conditional on $\mathcal{E}_{m}$ so that $\chi_{m}$ is fixed, $\mathcal{F}_{Lip,m}$ is a multivariate Lipschitz transformation of $\mathcal{F}_{\tau,TV}$ and $\mathcal{F}_{\tau,iso}$ , and therefore, by Theorem 2.10.20 of van der Vaart and Wellner (1996), we have that $\mathcal{J}(\delta,\mathcal{F}_{Lip,m})\lessapprox\mathcal{J}(\delta,\mathcal{F}_{\tau,TV})+\mathcal{J}(\delta,\mathcal{F}_{\tau,iso}).$ Since functions of bounded total variation can be written as a difference of nondecreasing monotone functions, we have by the same theorem that $\mathcal{J}(\delta,\mathcal{F}_{TV})\lessapprox\mathcal{J}(\delta,\mathcal{F}_{iso}).$ We claim the same upper bound holds up to a constant for $\mathcal{F}_{\tau,TV}$ and $\mathcal{F}_{\tau,iso}$ . We establish this explcitly for $\mathcal{F}_{\tau,iso}$ below; the result for $\mathcal{F}_{\tau,TV}$ follows from an identical argument. We note that

[TABLE]

where $Q\circ\tau^{-1}$ is the push-forward probability measure for the random variable $\tau(W)$ . We now proceed with bounding $\phi_{n}(\delta)$ . Applying Theorem 2.10.20 of van der Vaart and Wellner (1996), we obtain, for any $\delta>0$ deterministic conditionally on $\mathcal{E}_{m}$ , that

[TABLE]

where the right-hand side can only be random through $\delta$ .

We can now proceed with the main argument that gives a rate of convergence for $\delta_{n}$ . First, we note that combining Equations 7 and 10 yields that the event

[TABLE]

occurs with probability one. We then proceed with a peeling argument to account for the randomness of $\delta_{n}$ . Let $\varepsilon_{n}$ be any given sequence that is deterministic conditional on $\mathcal{E}_{m}$ , and define $A_{s}$ as the event $\left\{2^{s+1}\varepsilon_{n}\geq\left\lVert\gamma_{0}(\tau_{n}^{*},\cdot)-\tau_{n}^{*}\right\rVert\geq 2^{s}\varepsilon_{n}\right\}$ as well as the random quantity $\epsilon_{m}^{nuis}:=\left\lVert(\pi_{m}-\pi_{0})(\mu_{m}-\mu_{0})\right\rVert$ . Then, for any $S>0$ , we have that

[TABLE]

In all the events in the above sum, we have that $S_{n}(\delta_{n})\leq S_{n}(2^{s+1}\varepsilon_{n})$ since $\delta_{n}=\left\lVert\gamma_{0}(\tau_{n}^{*},\cdot)-\tau_{n}^{*}\right\rVert$ . Next, manipulating the inequalities in the above events, we have that

[TABLE]

which implies that the sum in (C.3) is upper bounded by

[TABLE]

Using (C.3) and Markov’s inequality, we find that

[TABLE]

As a consequence of Lemma 5 and the covering number bound for bounded monotone functions given in Theorem 2.7.5 of van der Vaart and Wellner (1996), we have that $\mathcal{J}(2^{s+1}\varepsilon_{n},\mathcal{F}_{iso})=2^{s/2+1/2}\sqrt{\varepsilon_{n}}$ . Using this fact, we find that

[TABLE]

from which it follows that

[TABLE]

We now choose $\varepsilon_{n}:=\max\{\ell^{-1/3},\left\lVert(\pi_{m}-\pi_{0})(\mu_{m}-\mu_{0})\right\rVert\}$ , which indeed is deterministic conditional on $\mathcal{E}_{m}$ . This choice ensures that $\mathcal{J}(\varepsilon_{n},\mathcal{F}_{iso})\lessapprox\sqrt{\ell}\varepsilon_{n}^{2}$ and $\epsilon_{m}^{nuis}=\left\lVert(\pi_{m}-\pi_{0})(\mu_{m}-\mu_{0})\right\rVert\lessapprox\varepsilon_{n}$ , so that

[TABLE]

where the right-hand side is nonrandom. Thus, we have that

[TABLE]

As a consequence, for every $\varepsilon>0$ , we can find a constant $2^{S}$ sufficiently large such that $P\left(\left\lVert\gamma_{0}(\tau_{n}^{*},\cdot)-\tau_{n}^{*}\right\rVert\geq 2^{S}\varepsilon_{n}\right)<\varepsilon$ . In other words, we have shown that $\left\lVert\gamma_{0}(\tau_{n}^{*},\cdot)-\tau_{n}^{*}\right\rVert=O_{P}(\varepsilon_{n})$ for our choice of $\varepsilon_{n}$ , and so, $CAL(\tau_{n}^{*})=\left\lVert\gamma_{0}(\tau_{n}^{*},\cdot)-\tau_{n}^{*}\right\rVert^{2}=O_{P}(\varepsilon_{n}^{2})$ . The result follows from that the fact that $\varepsilon_{n}^{2}\leq\ell^{-2/3}+\left\lVert(\pi_{m}-\pi_{0})(\mu_{m}-\mu_{0})\right\rVert^{2}$ . ∎

Proof of Theorem 2

Proof.

By the definition of the pointwise median stated in Section 2.1, for each covariate value $w\in\mathcal{W}$ , there exists some random index $j_{n}(w)$ such that $\tau_{n}^{*}(w)=\tau_{n,j_{n}(w)}^{*}(w)$ . (We note here that this property may fail for other definitions of the median when $k$ is even.) Thus, we have that $|\gamma_{0}(\tau_{n}^{*},w)-\tau_{n}^{*}(w)|=|\gamma_{0}(\tau_{n,j_{n}(w)}^{*},w)-\tau_{n,j_{n}(w)}^{*}(w)|\leq\sum_{s=1}^{k}|\gamma_{0}(\tau_{n,s}^{*},w)-\tau_{n,s}^{*}(w)|$ , and so,

[TABLE]

where the final inequality follows from the Cauchy-Schwarz inequality. Squaring both sides gives that $\text{CAL}(\tau_{n}^{*})\leq k\sum_{s=1}^{k}\text{CAL}(\tau_{n,s}^{*})$ , as desired. ∎

Proof of Theorem 3

Proof.

As before, we may write $\tau_{n}^{*}=\theta_{n}^{*}\circ\tau$ for some $\theta_{n}^{*}\in\mathcal{F}_{iso}$ that minimizes the empirical risk

[TABLE]

over $\mathcal{F}_{iso}$ . For any given $\theta\in\mathcal{F}_{iso}$ , the one-sided path $\{\varepsilon\mapsto\theta_{n}^{*}+\varepsilon(\theta-\theta_{n}^{*}):\varepsilon\in[0,1]\}$ through $\theta_{n}^{*}$ lies entirely in $\mathcal{F}_{iso}$ since $\mathcal{F}_{iso}$ is a convex space. Furthermore, we have that

[TABLE]

for all $\theta\in\mathcal{F}_{iso}$ . The oracle isotonic risk minimizer $\tau_{0}^{*}$ can be expressed as $\tau_{0}^{*}=\theta_{0}\circ\tau$ where $\theta_{0}:=\operatorname*{argmin}_{\theta\in\mathcal{F}_{iso}}\left\lVert\theta\circ\tau-\tau_{0}\right\rVert$ . Taking $\theta=\theta_{0}$ in (13), we obtain the inequality

[TABLE]

Rearranging terms and adding and subtracting $P_{\ell}\{[(\theta_{0}-\theta_{n}^{*})\circ\tau](\chi_{0})\}$ in the above inequality implies that $P_{\ell}\{[(\theta_{0}-\theta_{n}^{*})\circ\tau](\chi_{m}-\chi_{0})\}\leq P_{\ell}\{[(\theta_{0}-\theta_{n}^{*})\circ\tau](\theta_{n}^{*}\circ\tau-\chi_{0})\}$ . Adding and subtracting $P\{[(\theta_{0}-\theta_{n}^{*})\circ\tau](\theta_{n}^{*}\circ\tau-\chi_{0})\}$ yields that

[TABLE]

Next, adding and subtracting $P\{(\theta_{0}\circ\tau)[(\theta_{0}-\theta_{n}^{*})\circ\tau]\}$ , we have that

[TABLE]

where we used the fact that $E[\chi_{0}(O)\,|\,W=w]=\tau_{0}(w)$ . Next, we note that $\theta_{0}$ minimizes the population risk function $\theta\mapsto E_{P}[\tau_{0}(W)-\theta\circ\tau(W)]^{2}$ over $\mathcal{F}_{iso}$ . As a consequence, the same argument used to derive (14) can be used to obtain that $P\{[(\theta-\theta_{0})\circ\tau](\tau_{0}-\theta_{0}\circ\tau)\}\leq 0$ for any $\theta\in\mathcal{F}_{iso}$ . Taking $\theta=\theta_{n}^{*}$ , we find that

[TABLE]

Combining (C.3) and (17), we obtain that

[TABLE]

Finally, combining (C.3) and (18), we obtain the following inequality

[TABLE]

Adding and subtracting $P\{[(\theta_{0}-\theta_{n}^{*})\circ\tau](\chi_{m}-\chi_{0})\}$ and noting that $\tau_{0}^{*}-\tau_{n}^{*}=(\theta_{0}-\theta_{n}^{*})\circ\tau$ , we finally obtain the key inequality

[TABLE]

The above is similar to (7) in the proof of Theorem 1, and a similar proof technique is used to establish a convergence rate for $\tau_{n}^{*}$ . Specifically, we use the Cauchy-Schwarz inequality to bound the first term on the right-hand side of (C.3) in terms of $\left\lVert\tau_{0}^{*}-\tau_{n}^{*}\right\rVert$ , and empirical process techniques to bound the remaining terms in terms of a function of $\left\lVert\tau_{0}^{*}-\tau_{n}^{*}\right\rVert$ with high probability. Using a similar approach as for the derivation of (10), we can upper-bound the first term of the right-hand side of (C.3) as $P[(\tau_{0}^{*}-\tau_{n}^{*})(\chi_{0}-\chi_{m})]\leq\|\tau_{0}^{*}-\tau_{n}^{*}\|\|(\pi_{m}-\pi_{0})(\mu_{m}-\mu_{0})\|$ . The second term in the right-hand side of (C.3) can be examined as follows. We let $\mathcal{F}_{4,m}:=\{(\tau_{1}-\tau_{2})(\chi_{m}-\chi_{0});\tau_{1},\tau_{2}\in\mathcal{F}_{\tau,iso}\}$ , and define $Q:=\sup_{o\in\mathcal{O}}\chi_{0}(o)$ , which is finite in view of Conditions 1 and 2. Additionally, we let $R:=Q+B$ , and define for any fixed $\delta\in\mathbb{R}$

[TABLE]

Letting $\delta_{1,n}:=\left\lVert\tau_{0}^{*}-\tau_{n}^{*}\right\rVert$ , we have that $(P-P_{\ell})[(\tau_{0}^{*}-\tau_{n}^{*})(\chi_{m}-\chi_{0})]\leq Z_{1,n}(\delta_{1,n})$ . We note that $\mathcal{F}_{4,m}$ is a Lipschitz transformation of the function classes $\mathcal{F}_{\tau,iso}$ and $\mathcal{F}_{\tau,iso}$ , and so, for every $\delta>0$ that is deterministic conditional on $\mathcal{E}_{m}$ , we have that

[TABLE]

in view of Theorem 2.10.20 of van der Vaart and Wellner (1996) and the results outlined in Theorem 1, where the right-hand side can only be random through $\delta$ . Finally, the third term in (C.3) can be studied as follows. We let $\mathcal{F}_{5}:=\{(\tau_{1}-\tau_{2})(\tau_{2}-\chi_{0}):\tau_{1},\tau_{2}\in\mathcal{F}_{\tau,iso}\}$ , and for any given $\delta>0$ , we define

[TABLE]

with $G:=Q+B$ . We note that $\mathcal{F}_{5}$ is a Lipschitz transformation of $\mathcal{F}_{\tau,iso}$ . Hence, similarly as above, for any $\delta>0$ that is nonrandom conditional on $\mathcal{E}_{m}$ , we have that

[TABLE]

where the right-hand side can only berandom through $\delta$ . Defining $\epsilon_{m}^{nuis}:=\left\lVert(\pi_{m}-\pi_{0})(\mu_{m}-\mu_{0})\right\rVert$ , by a similar peeling argument as in Theorem 1, for any rate $\varepsilon_{n}$ that is nonrandom conditional on $\mathcal{E}_{m}$ , we can show that

[TABLE]

Then, by the same arguments used in Theorem 1 and the same choice of $\mathcal{E}_{m}$ -random $\varepsilon_{n}$ , we can establish that $\left\lVert\tau_{0}^{*}-\tau_{n}^{*}\right\rVert=O_{P}(\ell^{-1/3})+O_{P}(\left\lVert(\pi_{m}-\pi_{0})(\mu_{m}-\mu_{0})\right\rVert)$ . By the triangle inequality and the fact that $\tau_{0}^{*}=\operatorname*{argmin}_{\theta\circ\tau:\theta\in\mathcal{F}_{iso}}\left\lVert\tau_{0}-\theta\circ\tau\right\rVert$ implies $\left\lVert\tau_{0}-\tau_{0}^{*}\right\rVert\leq\left\lVert\tau_{0}-\tau\right\rVert$ , we find that $\left\lVert\tau_{0}-\tau_{n}^{*}\right\rVert\leq\left\lVert\tau_{0}-\tau_{0}^{*}\right\rVert+\left\lVert\tau_{0}^{*}-\tau_{n}^{*}\right\rVert\leq\left\lVert\tau_{0}-\tau\right\rVert+\left\lVert\tau_{0}^{*}-\tau_{n}^{*}\right\rVert$ . Combining these bounds, we find that $\left\lVert\tau_{0}-\tau_{n}^{*}\right\rVert\leq\left\lVert\tau_{0}-\tau\right\rVert+O_{P}(\ell^{-1/3})+O_{P}(\left\lVert(\pi_{m}-\pi_{0})(\mu_{m}-\mu_{0})\right\rVert)$ . ∎

C.4 Statement and proof of generalized Theorem 1 for random predictor

Here, we consider the same setup as Theorem 1 but allow $\tau_{n}^{*}$ to be obtained from a random predictor $\tau_{m}$ , as long as $\tau_{m}$ is built using only data in $\mathcal{E}_{m}$ .

Condition 6 (independence of predictor).

The predictor $w\mapsto\tau_{m}(w)$ is independent of $\mathcal{C}_{\ell}$ .

Theorem 7 (Calibration with random predictors).

Provided Conditions 1–6 hold, it holds that

[TABLE]

Proof.

Arguing exactly as in Theorem 1 with $\tau$ taken to be $\tau_{m}$ and conditioning on $\mathcal{E}_{m}$ as needed, we obtain the basic inequality stating that

[TABLE]

$P$ -almost surely, where $\tau_{n}^{*}:=\theta_{n}^{*}\circ\tau_{m}$ . To establish the result of the theorem, we only need to make minor modifications to the proof of Theorem 1 to allow $\tau$ to be replaced by $\tau_{m}$ . We sketch those modifications here. A core component of the proof of Theorem 1 involved upper-bounding $E[S_{n}(\delta)\,|\,\mathcal{E}_{m}]$ ; this must now be done with $S_{n}(\delta)$ defined as

[TABLE]

with $\tau_{m}$ now a random predictor. Previously, we showed that $E[S_{n}(\delta)\,|\,\mathcal{E}_{m}]$ can be bounded by a nonrandom constant depending on $n$ , $m$ and $\delta$ that is independent of $\mathcal{E}_{m}$ . To do so, we showed that the random function class $\mathcal{F}_{Lip,m}$ is fixed conditional on $\mathcal{E}_{m}$ , uniformly bounded, and has uniform entropy integral bounded by the uniform entropy integral of $\mathcal{F}_{iso}$ . It suffices to show that this remains true when $\tau$ is replaced by $\tau_{m}$ . Since $\tau_{m}$ is obtained from $\mathcal{E}_{m}$ , as with $\chi_{m}$ , the predictor $\tau_{m}$ is deterministic conditionally on $\mathcal{E}_{m}$ . As a consequence, the function classes $\mathcal{F}_{\tau_{m},TV}$ and $\mathcal{F}_{\tau_{m},iso}$ , which are now random through $\tau_{m}$ , are fixed conditional on $\mathcal{E}_{m}$ . Since $\mathcal{F}_{Lip,m}$ is obtained from a Lipschitz transformation of elements of $\mathcal{F}_{\tau_{m},TV}$ and $\mathcal{F}_{\tau_{m},iso}$ , we have that $\mathcal{F}_{Lip,m}$ is also fixed conditional on $\mathcal{E}_{m}$ . Moreover, by the same argument as in the proof of Lemma 5, which also holds for random $\tau$ , these function classes are uniformly bounded by a nonrandom constant almost surely. Finally, the preservation of the uniform entropy integral argument of the proof of Theorem 1 is valid with $\tau$ random. With these modifications to the proof of Theorem 1, the result follows. ∎

Appendix D Simulation studies

D.1 Data-generating mechanisms

In simulation studies, data units were generated as follows for the two scenarios considered.

Scenario 1:

generate $W_{1},W_{2},\ldots,W_{4}$ independently from the uniform distribution on $(-1,+1)$ ; 2. 2.

given $(W_{1},W_{2},W_{3},W_{4})=(w_{1},w_{2},w_{3},w_{4})$ , generate $A$ as a Bernoulli random variable with success probability $\pi_{0}(w_{1},w_{2},w_{3},w_{4}):=\text{expit}\{-0.25-w_{1}+0.5w_{2}-w_{3}+0.5w_{4}\}$ ; 3. 3.

given $(W_{1},W_{2},W_{3},W_{4})=(w_{1},w_{2},w_{3},w_{4})$ and $A=a$ , generate $Y$ as a Bernoulli random variable with success probability $\mu_{0}(a,w_{1},w_{2},\ldots,w_{4}):=\text{expit}\{1.5+1.5a+2a|w_{1}||w_{2}|-2.5(1-a)|w_{2}|w_{3}+2.5w_{3}+2.5(1-a)\sqrt{|w_{4}|}-1.5aI(w_{2}<0.5)+1.5(1-a)I(w_{4}<0)\}$ .

Scenario 2:

•

generate $W_{1},W_{2},\ldots,W_{20}$ independently from the uniform distribution on $(-1,+1)$ ;

•

given $(W_{1},W_{2},\ldots,W_{20})=(w_{1},w_{2},\ldots,w_{20})$ , generate $A$ as a Bernoulli random variable with success probability $\pi_{0}(w_{1},w_{2},\ldots,w_{20}):=\text{expit}\{0.2-0.5w_{1}-0.5w_{2}-0.5w_{3}+0.5w_{4}-0.5w_{5}+0.5w_{6}-0.5w_{7}-0.5w_{8}-0.5w_{9}-0.2w_{10}+0.5w_{11}-w_{12}+w_{13}-1.5w_{14}+w_{15}-w_{16}+2w_{17}-w_{18}+1.5w_{19}-w_{20}\}$ ;

•

given $(W_{1},W_{2},\ldots,W_{20})=(w_{1},w_{2},\ldots,w_{20})$ and $A=a$ , generate $Y$ as a normal random variable with mean $\mu_{0}(a,w_{1},w_{2},\ldots,w_{20})=-0.5+3.5a+3aw_{1}+6.5(1-a)w_{2}+1.5aw_{3}+4(1-a)w_{4}+2.5aw_{5}-6(1-a)w_{6}+1aw_{7}+4.5(1-a)w_{8}+aw_{9}+2.5(1-a)w_{10}+1.5w_{11}-2.5w_{12}+w_{13}-1.5w_{14}+3w_{15}-2w{{}_{1}6}+3w_{17}-w_{18}+1.5w_{19}-2w_{20}$ and unit variance.

Coefficients of the propensity score logistic regression models above were selected such that the probabilities of treatment were bounded between 0.05 and 0.95 in the low-dimensional case (Scenario 1), and between 0.01 and 0.99 in the high-dimensional setting (Scenario 2).

D.2 Implementation of the causal isotonic calibrator

In our simulation studies, we followed Algorithm 3 to fit the causal isotonic calibrator. In particular, we estimated the components of $\chi_{0}$ (i.e., $\mu_{0}$ and $\pi_{0}$ ) using the Super Learner (van der Laan et al., 2007) in Scenario 1, and penalized regression in Scenario 2. Super learner is an ensemble learning approach that uses cross-validation to select a convex combination of a library of candidate prediction methods. Table 1 shows the library of prediction models we used to estimate $\mu_{0}$ and $\pi_{0}$ . Note that all of our models for the outcome regression were misspecified in Scenario 1 because of the nonlinearities in the true outcome regression. However, in both scenarios, the propensity score estimator was a consistent estimator of the true propensity score. Additionally, for numerical stability, we imposed a threshold on the estimated propensity scores such that it took values between 0.01 and 0.99. We used the R package sl3 (Coyle et al., 2021) to implement the estimation procedure. Finally, we used the R function isoreg to performed the isotonic regression step.

D.3 Performance metrics

We estimated the performance metrics as follows. With a slight abuse of notation, let $\hat{\tau}$ denote an arbitrary estimated treatment effect predictor or its calibrated version. For each fitted $\hat{\tau}$ in a given simulation, we computed its mean squared error by taking the empirical mean of the squared difference between the fitted values of the CATE estimator and $\tau_{0}$ ,

[TABLE]

We obtained the estimated calibration measure in two steps. We recall that the calibration measure for a given predictor $\tau$ is

[TABLE]

First, we estimated $\gamma_{0}(\hat{\tau},w)$ using an independent dataset of 100,000 observations and fitted gradient boosted regression trees with the fitted values of the treatment effect predictors as covariates and the true CATE as outcome. For each simulation setting and CATE estimator, the depths of each of the regression trees were obtained using cross-validation in a separate simulation. Let $\hat{\gamma}_{0}(\hat{\tau},w)$ denote the estimated function. In the second step, we used the sample $\mathcal{V}$ to estimate the calibration measure as

[TABLE]

The above measure has the advantage of having less bias with respect to $\text{CAL}(\hat{\tau})$ than the plug-in estimator $n_{\mathcal{V}}^{-1}\sum_{i:w_{i}\in\mathcal{V}}\left[\hat{\gamma}_{0}(\hat{\tau},w_{i})-\hat{\tau}(w_{i})\right]^{2}$ .

Appendix E Simulation results

Bibliography71

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Athey (2017) S. Athey. Beyond prediction: Using big data for policy problems. Science , 355(6324):483–485, 2017.
2Athey and Imbens (2016) S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences , 113(27):7353–7360, 2016.
3Athey and Wager (2019) S. Athey and S. Wager. Estimating treatment effects with causal forests: An application. Observational Studies , 5(2):37–51, 2019.
4Barlow and Brunk (1972) R. E. Barlow and H. D. Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association , 67(337):140–147, 1972.
5Bella et al. (2010) A. Bella, C. Ferri, J. Hernández-Orallo, and M. J. Ramírez-Quintana. Calibration of machine learning models. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques , pages 128–146. IGI Global, 2010.
6Breiman (2001) L. Breiman. Random forests. Machine learning , 45(1):5–32, 2001.
7Brooks et al. (2012) J. Brooks, M. J. van der Laan, and A. S. Go. Targeted maximum likelihood estimation for prediction calibration. The international journal of biostatistics , 8(1), 2012.
8Carnegie et al. (2019) N. Carnegie, V. Dorie, and J. L. Hill. Examining treatment effect heterogeneity using bart. Observational Studies , 5(2):52–70, 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Causal isotonic calibration

Abstract

1 Introduction

2 Statistical Setup

2.1 Notation and Definitions

2.2 Measuring Calibration and the Calibration-Distortion Decomposition

2.3 Calibrating Predictors: desiderata and classical methods

3 Causal Isotonic Calibration

4 Large-Sample Theoretical Properties

Condition 1** (bounded outcome support).**

Condition 2** (positivity).**

Condition 3** (independence).**

Condition 4** (bounded range of πm\pi_{m}πm​, μm\mu_{m}μm​, τ\tauτ).**

Condition 5** (bounded variation of best predictor).**

Theorem 1** (τn∗\tau_{n}^{*}τn∗​ is well-calibrated).**

Theorem 2** (Pointwise median preserves calibration).**

Theorem 3** (Causal isotonic calibration does not inflate MSE much).**

5 Simulation Studies

5.1 Data-Generating Mechanisms

5.2 CATE Estimation

5.3 Performance Metrics

5.4 Simulation Results

6 Conclusion

Appendix A Implementation of algorithms in R

Appendix B Algorithm for causal isotonic calibration with cross-fitted nuisance estimates

Appendix C Technical proofs

C.1 Notation & definitions

C.2 Technical lemmas

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

C.3 Proofs of theorems

Proof of Theorem 1

Proof.

Proof of Theorem 2

Proof.

Proof of Theorem 3

Proof.

C.4 Statement and proof of generalized Theorem 1 for random predictor

Condition 6** (independence of predictor).**

Theorem 7** (Calibration with random predictors).**

Proof.

Appendix D Simulation studies

D.1 Data-generating mechanisms

D.2 Implementation of the causal isotonic calibrator

D.3 Performance metrics

Appendix E Simulation results

Condition 1 (bounded outcome support).

Condition 2 (positivity).

Condition 3 (independence).

Condition 4 (bounded range of $\pi_{m}$ , $\mu_{m}$ , $\tau$ ).

Condition 5 (bounded variation of best predictor).

Theorem 1 ( $\tau_{n}^{*}$ is well-calibrated).

Theorem 2 (Pointwise median preserves calibration).

Theorem 3 (Causal isotonic calibration does not inflate MSE much).

Lemma 4.

Lemma 5.

Lemma 6.

Condition 6 (independence of predictor).

Theorem 7 (Calibration with random predictors).