Time-uniform confidence bands for the CDF under nonstationarity

Paul Mineiro; Steven R. Howard

arXiv:2302.14248·stat.ML·March 1, 2023

Time-uniform confidence bands for the CDF under nonstationarity

Paul Mineiro, Steven R. Howard

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper develops valid, time-uniform confidence bands for the cumulative distribution function (CDF) of a random variable in nonstationary settings, extending to importance-weighted cases for counterfactual analysis.

Contribution

It introduces the first computationally feasible, always valid confidence bounds on the CDF under nonstationarity, with convergence guarantees and applicability to importance-weighted data.

Findings

01

Provides time-uniform confidence bands valid under nonstationarity.

02

Extends bounds to importance-weighted estimations for counterfactual distributions.

03

Guarantees convergence in arbitrary data-dependent environments.

Abstract

Estimation of the complete distribution of a random variable is a useful primitive for both manual and automated decision making. This problem has received extensive attention in the i.i.d. setting, but the arbitrary data dependent setting remains largely unaddressed. Consistent with known impossibility results, we present computationally felicitous time-uniform and value-uniform bounds on the CDF of the running averaged conditional distribution of a real-valued random variable which are always valid and sometimes trivial, along with an instance-dependent convergence guarantee. The importance-weighted extension is appropriate for estimating complete counterfactual distributions of rewards given controlled experimentation data exhaust, e.g., from an A/B test or a contextual bandit.

Tables2

Table 1. Table 1: Comparison to prior art for CDF estimation. See Section 5 for details.

Reference	Quantile-
	Time-
	Non-
	Non-
	Non-
	Counter-
	$w_{\max}$ -
	free?¹¹1 $w_{\max}$ free techniques are valid with unbounded importance weights.
HR22	✓	✓		✓	✓		N/A
HLLA21	✓			✓	✓	✓
UnO21 IID	✓			✓	✓	✓	✓
UnO21 NS	✓		✓			✓	✓
WS22 §4	✓	✓		✓	✓	✓	✓
this paper	✓	✓	✓	✓	✓	✓	✓

Table 2. Table 2 : Comparison of DDRM and Empirical Bernstein on i.i.d. X t ∼ Beta ( 6 , 3 ) similar-to subscript 𝑋 𝑡 Beta 6 3 X_{t}\sim\text{Beta}(6,3) , for different W t subscript 𝑊 𝑡 W_{t} . Width denotes the maximum bound width sup v U t ( v ) − L t ( v ) subscript supremum 𝑣 subscript 𝑈 𝑡 𝑣 subscript 𝐿 𝑡 𝑣 \sup_{v}U_{t}(v)-L_{t}(v) . Time is for computing the bound at 1000 equally spaced points.

$W_{t}$	What	Width	Time (sec)
$Exp (1)$	DDRM	0.09	24.8
$Exp (1)$	Emp. Bern	0.10	1.0
$Pareto (3 / 2)$	DDRM	0.052	59.4
$Pareto (3 / 2)$	Emp. Bern	0.125	2.4

Equations97

\overline{CDF}_{t} (v) ≐ \frac{1}{t} s \leq t \sum E_{s - 1} [1_{X_{s} \leq v}],

\overline{CDF}_{t} (v) ≐ \frac{1}{t} s \leq t \sum E_{s - 1} [1_{X_{s} \leq v}],

P (\forall t \in N \forall v \in R : L_{t} (v) \leq \overline{CDF}_{t} (v) \leq U_{t} (v)) \geq 1 - 2 α .

P (\forall t \in N \forall v \in R : L_{t} (v) \leq \overline{CDF}_{t} (v) \leq U_{t} (v)) \geq 1 - 2 α .

E_{t} (λ) ≐ exp (λ S_{t} - s \leq t \sum lo g (h (λ, θ_{s}))),

E_{t} (λ) ≐ exp (λ S_{t} - s \leq t \sum lo g (h (λ, θ_{s}))),

P (\forall t : \overline{CDF}_{t} (v) \geq Λ_{t} (ρ; δ, Ψ_{t}))

P (\forall t : \overline{CDF}_{t} (v) \geq Λ_{t} (ρ; δ, Ψ_{t}))

P (\forall t : \overline{CDF}_{t} (v) \leq Ξ_{t} (ρ; δ, Ψ_{t}))

\forall t, \forall v : U_{t} (v) - L_{t} (v) \leq \frac{V _{t}}{t} + \tilde{O} (\frac{V _{t}}{t} lo g (ξ_{t}^{- 2} α^{- 1} t^{3/2})),

\forall t, \forall v : U_{t} (v) - L_{t} (v) \leq \frac{V _{t}}{t} + \tilde{O} (\frac{V _{t}}{t} lo g (ξ_{t}^{- 2} α^{- 1} t^{3/2})),

\forall t, \forall v : U_{t} (v) - L_{t} (v) \leq B_{t} + \frac{( τ + V _{t} ) / t}{t} + \tilde{O} (\frac{( τ + V _{t} ) / t}{t} lo g (ξ_{t}^{- 2} α^{- 1})) + \tilde{O} (t^{- 1} lo g (ξ_{t}^{- 2} α^{- 1})),

\forall t, \forall v : U_{t} (v) - L_{t} (v) \leq B_{t} + \frac{( τ + V _{t} ) / t}{t} + \tilde{O} (\frac{( τ + V _{t} ) / t}{t} lo g (ξ_{t}^{- 2} α^{- 1})) + \tilde{O} (t^{- 1} lo g (ξ_{t}^{- 2} α^{- 1})),

P (\forall x \in R, \dot{L}_{n} (x) \leq F (x) \leq \dot{U}_{n} (x)) \geq 1 - α,

P (\forall x \in R, \dot{L}_{n} (x) \leq F (x) \leq \dot{U}_{n} (x)) \geq 1 - α,

P (\forall x \in R, t \in N, \widebar L_{t} (x) \leq F (x) \leq \widebar U_{t} (x)) \geq 1 - α .

P (\forall x \in R, t \in N, \widebar L_{t} (x) \leq F (x) \leq \widebar U_{t} (x)) \geq 1 - α .

E_{t} (λ) ≐ exp (λ S_{t} - s \leq t \sum lo g (h (λ, θ_{s}))),

E_{t} (λ) ≐ exp (λ S_{t} - s \leq t \sum lo g (h (λ, θ_{s}))),

E_{t} (λ) \geq exp (λ t (q_{t} - \overset{q}{^}_{t}) - t h (λ, q_{t})),

E_{t} (λ) \geq exp (λ t (q_{t} - \overset{q}{^}_{t}) - t h (λ, q_{t})),

P (\exists t : \overline{CDF}_{t} (v) > Ξ_{t} (i / ϵ (d); δ_{d}, d, Ψ_{t}))

P (\exists t : \overline{CDF}_{t} (v) > Ξ_{t} (i / ϵ (d); δ_{d}, d, Ψ_{t}))

P (\exists d \in N, i \in {1, \dots, d}, t \in N : \overline{CDF}_{t} (i / ϵ (d)) > Ξ_{t} (i / ϵ (d); δ, d, Ψ_{t})) \leq α .

P (\exists d \in N, i \in {1, \dots, d}, t \in N : \overline{CDF}_{t} (i / ϵ (d)) > Ξ_{t} (i / ϵ (d); δ, d, Ψ_{t})) \leq α .

P (\forall t, d : \overline{CDF}_{t} (v) \leq \overline{CDF}_{t} (ρ_{d}) \leq Ξ_{t} (ρ_{d}; δ_{d}, d, Ψ_{t})) \geq 1 - α .

P (\forall t, d : \overline{CDF}_{t} (v) \leq \overline{CDF}_{t} (ρ_{d}) \leq Ξ_{t} (ρ_{d}; δ_{d}, d, Ψ_{t})) \geq 1 - α .

M (t; q_{t}, τ)

M (t; q_{t}, τ)

\frac{α}{η ^{2 d}}

\frac{α}{η ^{2 d}}

\overline{CDF}_{t} (ϵ (d) ⌈ ϵ (d)^{- 1} v ⌉) - \overline{CDF}_{t} (v) \leq 1/ (ξ_{t} η^{d}),

\overline{CDF}_{t} (ϵ (d) ⌈ ϵ (d)^{- 1} v ⌉) - \overline{CDF}_{t} (v) \leq 1/ (ξ_{t} η^{d}),

r_{d} (t)

r_{d} (t)

r_{d} (t)

r_{d} (t)

∣ k_{d} ∣ - 1

∣ k_{d} ∣ - 1

ξ_{t} / ψ_{t}

⟹ 1 + ∣ k_{d} ∣

⟹ r_{d} (t)

E_{t} (λ) \geq exp (λ (min (t, s \leq t \sum Y_{s}) - E_{s - 1} [Y_{s}]) + s \leq t \sum lo g (1 + λ (Y_{s} - \overline{Y_{t}})) - Reg (t)),

E_{t} (λ) \geq exp (λ (min (t, s \leq t \sum Y_{s}) - E_{s - 1} [Y_{s}]) + s \leq t \sum lo g (1 + λ (Y_{s} - \overline{Y_{t}})) - Reg (t)),

M_{t}^{EB} ≐ (\frac{τ ^{τ} e ^{- τ}}{Γ ( τ ) - Γ ( τ , τ )}) (\frac{1}{τ + V _{t}})_{1} F_{1} (1, V_{t} + τ + 1, S_{t} + V_{t} + τ),

M_{t}^{EB} ≐ (\frac{τ ^{τ} e ^{- τ}}{Γ ( τ ) - Γ ( τ , τ )}) (\frac{1}{τ + V _{t}})_{1} F_{1} (1, V_{t} + τ + 1, S_{t} + V_{t} + τ),

\frac{α}{η ^{2 d}}

\frac{α}{η ^{2 d}}

u (V_{t}; τ, \frac{α}{η ^{2 d}})

+ \frac{1}{t} lo g (\frac{τ + V _{t}}{2 π} e^{- \frac{1}{12 ( τ + V _{t} ) + 1}} (\frac{1 + η ^{2 d} α ^{- 1}}{C ( τ )})),

\overline{CDF}_{t} (ϵ (d) ⌈ ϵ (d)^{- 1} v ⌉) - \overline{CDF}_{t} (v) \leq 1/ (ξ_{t} η^{d}),

\overline{CDF}_{t} (ϵ (d) ⌈ ϵ (d)^{- 1} v ⌉) - \overline{CDF}_{t} (v) \leq 1/ (ξ_{t} η^{d}),

r_{d} (t)

r_{d} (t)

r_{d} (t)

r_{d} (t)

= \frac{( τ + V _{t} ) / t}{t} + \tilde{O} (\frac{( τ + V _{t} ) / t}{t} lo g (ξ_{t}^{- 2} α^{- 1})) + \tilde{O} (t^{- 1} lo g (ξ_{t}^{- 2} α^{- 1})),

exp (λ S_{t} - ψ_{e} (λ) V_{t}),

exp (λ S_{t} - ψ_{e} (λ) V_{t}),

ψ_{e} (λ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/csrobust
none

Videos

Time-uniform confidence bands for the CDF under nonstationarity· slideslive

Taxonomy

TopicsDistributed Sensor Networks and Detection Algorithms · Advanced Bandit Algorithms Research · Machine Learning and Algorithms

Full text

Time-uniform confidence bands for the CDF under nonstationarity

Paul Mineiro

Microsoft Research

[email protected] &Steve Howard

[email protected]

Abstract

Estimation of the complete distribution of a random variable is a useful primitive for both manual and automated decision making. This problem has received extensive attention in the i.i.d. setting, but the arbitrary data dependent setting remains largely unaddressed. Consistent with known impossibility results, we present computationally felicitous time-uniform and value-uniform bounds on the CDF of the running averaged conditional distribution of a real-valued random variable which are always valid and sometimes trivial, along with an instance-dependent convergence guarantee. The importance-weighted extension is appropriate for estimating complete counterfactual distributions of rewards given controlled experimentation data exhaust, e.g., from an A/B test or a contextual bandit.

1 Introduction

What would have happened if I had acted differently? Although this question is as old as time itself, successful companies have recently embraced this question via counterfactual estimation of outcomes from the exhaust of their controlled experimentation platforms, e.g., based upon A/B testing or contextual bandits. These experiments are run in the real (digital) world, which is rich enough to demand statistical techniques that are non-asymptotic, non-parametric, and non-stationary. Although recent advances admit characterizing counterfactual average outcomes in this general setting, counterfactually estimating a complete distribution of outcomes is heretofore only possible with additional assumptions. Nonethless, the practical importance of this problem has motivated multiple solutions: see Section 1 for a summary, and Section 5 for complete discussion.

Intriguingly, this problem is provably impossible in the data dependent setting without additional assumptions. Rakhlin et al. (2015) Consequently, our bounds always achieve non-asymptotic coverage, but may converge to zero width slowly or not at all, depending on the hardness of the instance. We call this design principle AVAST (Always Valid And Sometimes Trivial).

In pursuit of our ultimate goal, we derive factual distribution estimators which are useful for estimating the complete distribution of outcomes from direct experience.

Contributions

In Section 3.1 we provide a time and value uniform upper bound on the CDF of the averaged historical conditional distribution of a discrete-time real-valued random process. Consistent with the lack of sequential uniform convergence of linear threshold functions (Rakhlin et al., 2015), the bounds are always valid and sometimes trivial, but with an instance-dependent guarantee: when the data generating process is smooth qua Block et al. (2022) with respect to the uniform distribution on the unit interval, the bound width adapts to the unknown smoothness parameter. 2. 2.

In Section 3.2 we extend the previous technique to distributions with support over the entire real line, and further to distributions with a known countably infinite or unknown nowhere dense set of discrete jumps; with analogous instance-dependent guarantees. 3. 3.

In Section 3.3 we extend the previous techniques to importance-weighted random variables, achieving our ultimate goal of estimating a complete counterfactual distribution of outcomes.

We exhibit our techniques in various simulations in Section 4. Computationally our procedures have comparable cost to point estimation of the empirical CDF, as the empirical CDF is a sufficient statistic.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Block et al. [2022] Adam Block, Yuval Dagan, Noah Golowich, and Alexander Rakhlin. Smoothed online learning is as easy as statistical learning. ar Xiv preprint ar Xiv:2202.04690 , 2022.
2Cantelli [1933] Francesco Paolo Cantelli. Sulla determinazione empirica delle leggi di probabilita. Giorn. Ist. Ital. Attuari , 4(421-424), 1933.
3Chandak et al. [2021] Yash Chandak, Scott Niekum, Bruno da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S Thomas. Universal off-policy evaluation. Advances in Neural Information Processing Systems , 34:27475–27490, 2021.
4Chatzigeorgiou [2013] Ioannis Chatzigeorgiou. Bounds on the lambert function and their application to the outage analysis of user cooperation. IEEE Communications Letters , 17(8):1505–1508, 2013.
5Dvoretzky et al. [1956] Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics , pages 642–669, 1956.
6Fan et al. [2015] Xiequan Fan, Ion Grama, and Quansheng Liu. Exponential inequalities for martingales with applications. Electronic Journal of Probability , 20:1–22, 2015.
7Feller [1958] William Feller. An introduction to probability theory and its applications, 3rd edition . Wiley series in probability and mathematical statistics, 1958.
8Glivenko [1933] Valery Glivenko. Sulla determinazione empirica delle leggi di probabilita. Gion. Ist. Ital. Attauri. , 4:92–99, 1933.