Sharp Calibrated Gaussian Processes

Alexandre Capone; Geoff Pleiss; Sandra Hirche

arXiv:2302.11961·cs.LG·November 20, 2023

Sharp Calibrated Gaussian Processes

Alexandre Capone, Geoff Pleiss, Sandra Hirche

PDF

Open Access

TL;DR

This paper introduces a flexible calibration method for Gaussian processes that improves the accuracy of predictive quantiles, ensuring better frequentist calibration and sharper confidence intervals in regression tasks.

Contribution

A novel calibration approach for Gaussian processes that uses hyperparameters optimized for empirical calibration, enhancing the sharpness and reliability of predictive intervals.

Findings

01

Outperforms existing calibration methods in sharpness.

02

Yields well-calibrated predictive quantiles.

03

Applicable under reasonable assumptions.

Abstract

While Gaussian processes are a mainstay for various engineering and scientific applications, the uncertainty estimates don't satisfy frequentist guarantees and can be miscalibrated in practice. State-of-the-art approaches for designing calibrated models rely on inflating the Gaussian process posterior variance, which yields confidence intervals that are potentially too coarse. To remedy this, we present a calibration approach that generates predictive quantiles using a computation inspired by the vanilla Gaussian process posterior variance but using a different set of hyperparameters chosen to satisfy an empirical calibration constraint. This results in a calibration approach that is considerably more flexible than existing approaches, which we optimize to yield tight predictive quantiles. Our approach is shown to yield a calibrated model under reasonable assumptions. Furthermore, it…

Tables2

Table 1. Table 1 : Expected calibration error and sharpness of different methods over 100 100 100 repetitions per experiment. We report the expected calibration error (ECE), the average predictive standard deviation (STD), negative log-likelihood (NLL) and 95 % percent \% confidence interval width (95 % percent \% CI) obtained with our approach, vanilla GPs recalibrated using the methods of Kuleshov et al. ( 2018 ) (RK) and Vovk et al. ( 2020 ) (RV), the variance-free approach proposed in Marx et al. ( 2022 ) (RM), and the neural network-based recalibrator of Kuleshov & Deshpande ( 2022 ) (NN). We additionally report performance for the base model (B), which corresponds to a vanilla GP without the holdout data. Lower is better for all metrics. In all experiments except the Facebook2 dataset, our method is sharpest compared to all other methods except that of Kuleshov & Deshpande ( 2022 ) . However, Kuleshov & Deshpande ( 2022 ) performs more poorly in terms of expected calibration error.

Data set	Metric	Ours	RK	RV	RM	NN	B
	ECE	0.003	0.0029	0.0029	0.0029	0.0056	0.041
	STD	0.16	0.31	0.3	0.33	0.22	1.9
Boston	NLL	0.21	0.39	0.4	0.42	-0.24	1.6
	95 $%$ CI	0.76	1.4	1.4	1.4	0.73	7.4
	ECE	0.0044	0.0043	0.0044	0.0043	0.0081	0.039
	STD	0.16	0.5	0.47	0.5	0.14	2.8
yacht	NLL	0.26	0.68	0.69	0.68	-2	2
	95 $%$ CI	0.76	2.3	2.3	2.3	0.3	11
	ECE	0.0036	0.0035	0.0035	0.0035	0.0053	0.044
	STD	0.13	0.38	0.38	0.38	0.29	2.8
mpg	NLL	0.032	0.63	0.64	0.63	0.02	2
	95 $%$ CI	0.6	1.7	1.8	1.7	0.96	11
	ECE	0.00047	0.00047	0.00047	0.00047	0.0067	0.0058
	STD	0.54	1	1	0.88	0.72	1.4
wine	NLL	1.2	1.3	1.3	1.3	-0.36	1.4
	95 $%$ CI	2.1	3.8	3.8	3.9	2.8	5.4
	ECE	0.00071	0.00064	0.00064	0.00064	0.00076	0.032
	STD	0.25	0.64	0.64	0.64	0.57	2.8
concrete	NLL	0.72	1.1	1.1	1.1	0.85	2
	95 $%$ CI	0.93	2.5	2.5	2.5	2.1	11
	ECE	0.00016	0.00016	0.00016	0.00016	0.00053	0.028
	STD	0.074	0.12	0.12	0.12	0.098	0.4
kin8nm	NLL	-0.54	-0.65	-0.65	-0.63	-0.76	0.1
	95 $%$ CI	0.26	0.47	0.47	0.48	0.44	1.6
	ECE	0.00044	0.00043	0.00043	0.00045	0.0089	0.044
	STD	0.068	0.18	0.18	0.18	0.18	1.2
Facebook2	NLL	3.6	-1.3	-1.3	-1.2	-2.3	1.2
	95 $%$ CI	0.6	1.7	1.7	1.7	3.4	4.6

Table 2. Table 2 : Rate of uniform error bound violation (RUEBV) and 100 % percent \% confidence interval width obtained with our approach and that of Capone et al. ( 2022 ) . Lower is better for all metrics.

Data set	Metric	Ours	Capone et al. (2022)
	Rate of uniform error bound violation	0.00376	0.000172
Boston	Length of 100 $%$ CI	1.2	28.3
	Rate of uniform error bound violation	0.004	0.0065
mpg	Length of 100 $%$ CI	1.7	24.13
	Rate of uniform error bound violation	0.00072	0.00096
wine	Length of 100 $%$ CI	4.7	24.8

Equations74

\displaystyle\mathbb{P}_{\bm{x},y\sim\Pi}\Big{(}y-\mu_{\mathcal{D}_{\text{tr}}}(\bm{x})\leq\beta_{\delta}\sigma_{\mathcal{D}}(\delta,\bm{x})\Big{)}=\delta

\displaystyle\mathbb{P}_{\bm{x},y\sim\Pi}\Big{(}y-\mu_{\mathcal{D}_{\text{tr}}}(\bm{x})\leq\beta_{\delta}\sigma_{\mathcal{D}}(\delta,\bm{x})\Big{)}=\delta

∣ β_{δ} σ_{D_{tr}} (δ, x) - β_{1 - δ} σ_{D_{tr}} (1 - δ, x) ∣

∣ β_{δ} σ_{D_{tr}} (δ, x) - β_{1 - δ} σ_{D_{tr}} (1 - δ, x) ∣

[K (θ)]_{ij} = k (θ, x_{i}, x_{j})

[K (θ)]_{ij} = k (θ, x_{i}, x_{j})

μ_{D_{tr}} (θ, x^{*}) =

μ_{D_{tr}} (θ, x^{*}) =

σ_{D_{tr}}^{2} (θ, x^{*}) = k (θ, x^{*}, x^{*}) - k (θ) (K (θ) + σ_{0}^{2} I)^{- 1} k (θ) + σ_{0}^{2},

lo g p (y ∣ X, θ) = - \frac{1}{2} lo g ∣ K (θ) + σ_{0}^{2} I ∣ - \frac{N}{2} lo g (2 π) - \frac{1}{2} y^{⊺} (K (θ) + σ_{0}^{2} I)^{- 1} y .

lo g p (y ∣ X, θ) = - \frac{1}{2} lo g ∣ K (θ) + σ_{0}^{2} I ∣ - \frac{N}{2} lo g (2 π) - \frac{1}{2} y^{⊺} (K (θ) + σ_{0}^{2} I)^{- 1} y .

β_{δ} \in R θ_{δ} \in Θ min s.t. i = 1 \sum N_{cal} β_{δ}^{2} σ_{D_{tr}}^{2} (θ_{δ}, x_{cal}^{i}) i = 1 \sum N_{cal} \frac{I _{\geq 0} ( Δ y _{cal}^{i} - β _{δ} σ _{D_{tr}} ( θ _{δ} , x _{cal}^{i} ) )}{N _{cal} + 1} = δ

β_{δ} \in R θ_{δ} \in Θ min s.t. i = 1 \sum N_{cal} β_{δ}^{2} σ_{D_{tr}}^{2} (θ_{δ}, x_{cal}^{i}) i = 1 \sum N_{cal} \frac{I _{\geq 0} ( Δ y _{cal}^{i} - β _{δ} σ _{D_{tr}} ( θ _{δ} , x _{cal}^{i} ) )}{N _{cal} + 1} = δ

β_{δ} = q_{lin} (δ, Σ_{D_{tr}}^{- 1} Δ y_{cal}),

β_{δ} = q_{lin} (δ, Σ_{D_{tr}}^{- 1} Δ y_{cal}),

q_{lin} (δ, a) = a_{i_{l}} + (δ (N + 1) - l) (a_{i_{l + 1}} - a_{i_{l}}) .

q_{lin} (δ, a) = a_{i_{l}} + (δ (N + 1) - l) (a_{i_{l + 1}} - a_{i_{l}}) .

Σ_{D_{tr}}^{- 1} Δ y_{cal} = (\frac{Δ y _{cal}^{1}}{σ _{D_{tr}} ( θ _{δ} , x _{cal}^{1} )}, ..., \frac{Δ y _{cal}^{N_{cal}}}{σ _{D_{tr}} ( θ _{δ} , x _{cal}^{N_{cal}} )})^{⊤}

Σ_{D_{tr}}^{- 1} Δ y_{cal} = (\frac{Δ y _{cal}^{1}}{σ _{D_{tr}} ( θ _{δ} , x _{cal}^{1} )}, ..., \frac{Δ y _{cal}^{N_{cal}}}{σ _{D_{tr}} ( θ _{δ} , x _{cal}^{N_{cal}} )})^{⊤}

θ \in Θ min i = 1 \sum N_{cal} [q_{lin} (δ, Σ_{D_{tr}}^{- 1} Δ y_{cal}) σ_{D_{tr}} (θ_{δ}, x_{cal}^{i})]^{2},

θ \in Θ min i = 1 \sum N_{cal} [q_{lin} (δ, Σ_{D_{tr}}^{- 1} Δ y_{cal}) σ_{D_{tr}} (θ_{δ}, x_{cal}^{i})]^{2},

β_{δ_{i}} \leq β_{δ_{j}} if δ_{i} < δ_{j},

β_{δ_{i}} \leq β_{δ_{j}} if δ_{i} < δ_{j},

θ_{δ_{i}} \leq θ_{δ_{j}}, if δ_{i} < δ_{j} and β_{δ_{i}} \geq 0,

θ_{δ_{i}} \leq θ_{δ_{j}}, if δ_{i} < δ_{j} and β_{δ_{i}} \geq 0,

θ_{δ_{i}} \geq θ_{δ_{j}}, if δ_{i} < δ_{j} and β_{δ_{j}} \leq 0.

θ_{δ_{i}} \geq θ_{δ_{j}}, if δ_{i} < δ_{j} and β_{δ_{j}} \leq 0.

θ_{δ_{i}} \leq θ_{δ_{j}}, if δ_{i} < δ_{j} and β_{δ_{i}} \geq 0,

θ_{δ_{i}} \leq θ_{δ_{j}}, if δ_{i} < δ_{j} and β_{δ_{i}} \geq 0,

θ_{δ_{i}} \geq θ_{δ_{j}}, if δ_{i} < δ_{j} and β_{δ_{j}} \leq 0.

θ_{δ_{i}} \geq θ_{δ_{j}}, if δ_{i} < δ_{j} and β_{δ_{j}} \leq 0.

δ_{1} = \frac{1}{N _{cal} + 1}, δ_{2} = \frac{2}{N _{cal} + 1}, \dots, δ_{N_{cal}} = \frac{N _{cal}}{N _{cal} + 1},

δ_{1} = \frac{1}{N _{cal} + 1}, δ_{2} = \frac{2}{N _{cal} + 1}, \dots, δ_{N_{cal}} = \frac{N _{cal}}{N _{cal} + 1},

\displaystyle\mathbb{P}_{\bm{x},y\sim\Pi}\Big{(}y-\mu_{\mathcal{D}_{\text{tr}}}(\hat{\bm{\theta}}(\delta),\bm{x})\leq\hat{\beta}(\delta)\sigma_{\mathcal{D}_{\text{tr}}}(\hat{\bm{\theta}}(\delta),\bm{x})\Big{)}\in\left[\delta-\frac{1}{N_{\text{cal}}+1},\delta+\frac{1}{N_{\text{cal}}+1}\right].

\displaystyle\mathbb{P}_{\bm{x},y\sim\Pi}\Big{(}y-\mu_{\mathcal{D}_{\text{tr}}}(\hat{\bm{\theta}}(\delta),\bm{x})\leq\hat{\beta}(\delta)\sigma_{\mathcal{D}_{\text{tr}}}(\hat{\bm{\theta}}(\delta),\bm{x})\Big{)}\in\left[\delta-\frac{1}{N_{\text{cal}}+1},\delta+\frac{1}{N_{\text{cal}}+1}\right].

cal (μ_{D_{tr}} (θ^{R}, \cdot), β_{\cdot}, σ_{D_{tr}} (\cdot, \cdot)) = j = 1 \sum m (p_{j} - \overset{p}{^}_{j})^{2},

cal (μ_{D_{tr}} (θ^{R}, \cdot), β_{\cdot}, σ_{D_{tr}} (\cdot, \cdot)) = j = 1 \sum m (p_{j} - \overset{p}{^}_{j})^{2},

\displaystyle\begin{split}\hat{p}_{j}=\frac{\left|\left\{y_{t}^{*}\ \big{|}\ \Delta y_{t}^{*}\leq\beta_{p_{j}}\sigma_{\mathcal{D}_{\text{tr}}}(\bm{\theta}_{p_{j}}\bm{x}_{t}^{*}),t=1,...,T\right\}\right|}{T}.\end{split}

\displaystyle\begin{split}\hat{p}_{j}=\frac{\left|\left\{y_{t}^{*}\ \big{|}\ \Delta y_{t}^{*}\leq\beta_{p_{j}}\sigma_{\mathcal{D}_{\text{tr}}}(\bm{\theta}_{p_{j}}\bm{x}_{t}^{*}),t=1,...,T\right\}\right|}{T}.\end{split}

φ (x_{cal}^{i_{j}}, y_{cal}^{i_{j}}) < φ (x_{cal}^{i_{j + 1}}, y_{cal}^{i_{j + 1}}),

φ (x_{cal}^{i_{j}}, y_{cal}^{i_{j}}) < φ (x_{cal}^{i_{j + 1}}, y_{cal}^{i_{j + 1}}),

\mathbb{P}_{\bm{x},y\sim\Pi}\Big{(}H(\varphi(\bm{x},y))\leq\delta\Big{)}\in\left[\delta-\frac{1}{N_{\text{cal}}+1},\delta+\frac{1}{N_{\text{cal}}+1}\right]\quad\forall\ \delta\in[0,1].

\mathbb{P}_{\bm{x},y\sim\Pi}\Big{(}H(\varphi(\bm{x},y))\leq\delta\Big{)}\in\left[\delta-\frac{1}{N_{\text{cal}}+1},\delta+\frac{1}{N_{\text{cal}}+1}\right]\quad\forall\ \delta\in[0,1].

y - μ_{D_{tr}} (θ^{R}, x) - \hat{β} (φ (x, y)) σ (\hat{θ} (φ (x, y)), x) = 0

y - μ_{D_{tr}} (θ^{R}, x) - \hat{β} (φ (x, y)) σ (\hat{θ} (φ (x, y)), x) = 0

δ \to \infty lim \tilde{β} (δ) = \infty, δ \to - \infty lim \tilde{β} (δ) = - \infty, δ \to \infty lim \tilde{θ} (δ) = \infty, δ \to - \infty lim \tilde{θ} (δ) = \infty,

δ \to \infty lim \tilde{β} (δ) = \infty, δ \to - \infty lim \tilde{β} (δ) = - \infty, δ \to \infty lim \tilde{θ} (δ) = \infty, δ \to - \infty lim \tilde{θ} (δ) = \infty,

\tilde{β} (δ) is strictly monotonically increasing for all δ \in R \tilde{θ} (δ) is monotonically increasing for all δ \in {δ \in R ∣ \tilde{β} (δ) > 0} \tilde{θ} (δ) is monotonically decreasing for all δ \in {δ \in R ∣ \tilde{β} (δ) < 0} .

\tilde{β} (δ) is strictly monotonically increasing for all δ \in R \tilde{θ} (δ) is monotonically increasing for all δ \in {δ \in R ∣ \tilde{β} (δ) > 0} \tilde{θ} (δ) is monotonically decreasing for all δ \in {δ \in R ∣ \tilde{β} (δ) < 0} .

y - μ_{D_{tr}} (θ^{R}, x) - \tilde{β} (δ) σ (\tilde{θ} (δ), x)

y - μ_{D_{tr}} (θ^{R}, x) - \tilde{β} (δ) σ (\tilde{θ} (δ), x)

y - μ_{D_{tr}} (θ^{R}, x) - \tilde{β} (φ (x, y)) σ (\tilde{θ} (φ (x, y)), x) = 0.

y - μ_{D_{tr}} (θ^{R}, x) - \tilde{β} (φ (x, y)) σ (\tilde{θ} (φ (x, y)), x) = 0.

y_{cal}^{i} - μ_{D_{tr}} (θ^{R}, x_{cal}^{i}) \neq = y_{cal}^{j} - μ_{D_{tr}} (θ^{R}, x_{cal}^{j})

y_{cal}^{i} - μ_{D_{tr}} (θ^{R}, x_{cal}^{i}) \neq = y_{cal}^{j} - μ_{D_{tr}} (θ^{R}, x_{cal}^{j})

H (φ (x_{cal}^{i_{j}}, y_{cal}^{i_{j}})) = \frac{j}{N _{cal} + 1},

H (φ (x_{cal}^{i_{j}}, y_{cal}^{i_{j}})) = \frac{j}{N _{cal} + 1},

\displaystyle\mathbb{P}_{\bm{x},y\sim\Pi}\Big{(}H\left(\varphi(\bm{x},y)\right)\leq\delta\Big{)}\in\left[\delta-\frac{1}{N_{\text{cal}}+1},\delta+\frac{1}{N_{\text{cal}}+1}\right]\quad\forall\ \delta\in[0,1].

\displaystyle\mathbb{P}_{\bm{x},y\sim\Pi}\Big{(}H\left(\varphi(\bm{x},y)\right)\leq\delta\Big{)}\in\left[\delta-\frac{1}{N_{\text{cal}}+1},\delta+\frac{1}{N_{\text{cal}}+1}\right]\quad\forall\ \delta\in[0,1].

y - μ_{D_{tr}} (θ^{R}, x) - \hat{β} (\overset{φ}{^} (x, y)) σ (\hat{θ} (\overset{φ}{^} (x, y)), x) = 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Fault Detection and Control Systems · Advanced Control Systems Optimization

MethodsGaussian Process

Full text

Sharp Calibrated Gaussian Processes

Alexandre Capone

Technical University of Munich

[email protected]

&Sandra Hirche

Technical University of Munich

[email protected]

Geoff Pleiss

University of British Columbia

Vector Institute

[email protected]

Abstract

While Gaussian processes are a mainstay for various engineering and scientific applications, the uncertainty estimates don’t satisfy frequentist guarantees and can be miscalibrated in practice. State-of-the-art approaches for designing calibrated models rely on inflating the Gaussian process posterior variance, which yields confidence intervals that are potentially too coarse. To remedy this, we present a calibration approach that generates predictive quantiles using a computation inspired by the vanilla Gaussian process posterior variance but using a different set of hyperparameters chosen to satisfy an empirical calibration constraint. This results in a calibration approach that is considerably more flexible than existing approaches, which we optimize to yield tight predictive quantiles. Our approach is shown to yield a calibrated model under reasonable assumptions. Furthermore, it outperforms existing approaches in sharpness when employed for calibrated regression.

1 Introduction

Gaussian process (GP) regression offers an ambitious proposition: by conditioning a model on measurement data, we are provided with a Gaussian probability distribution for the unseen data. Assuming that the posterior probability distribution holds, we can then directly calibrate our model using the inverse error function. Though the distribution of unseen data seldom follows the Gaussian prior distribution, and the GP generally does not adapt adequately to the observed distributions after being conditioned on the data, GPs have become one of the most powerful and established regression techniques. Besides having found widespread use in machine learning (Deisenroth et al., 2015; Srinivas et al., 2012), their good generalization properties have motivated applications in the fields of control (Kocijan, 2016), astrophysics (Roberts et al., 2013) and chemistry (Deringer et al., 2021), to name a few. Furthermore, the Bayesian paradigm offers a powerful tool to analyze the theoretical properties of different regression techniques (Srinivas et al., 2012; Capone et al., 2022).

In this paper, we present a novel approach to obtaining sharp calibrated Gaussian processes, i.e., Gaussian processes that provide concentrated predictive distributions that accurately match the observed data. Instead of computing confidence intervals by inflating the Gaussian process posterior variance, our approach discards it and computes a new quantity inspired by the computation of the posterior variance, where all hyperparameters are chosen in a way that results in both accurate and sharp calibration. In other words, we train two separate Gaussian processes: one for the predictive mean and one for obtaining predictive quantiles, which is exclusively used for calibration purposes. By doing so, we reach considerably more flexibility than existing calibration approaches, which enables us to additionally optimize the sharpness of the calibration. Our approach outperforms several state-of-the-art calibration approaches in terms of sharpness while still yielding similar calibration performance. Furthermore, it is competitive compared to a neural network-based method in sharpness without sacrificing calibration performance.

Notation.

We use $\mathbb{R}_{+}$ to denote the non-negative real numbers. Boldface lowercase/uppercase characters denote vectors/matrices. For two vectors $\bm{a}$ and $\bm{a}^{\prime}$ in $\mathbb{R}^{d}$ , we employ the notation $\bm{a}\leq\bm{a}^{\prime}$ to denote componentwise inequality, i.e., ${a}_{i}\leq{a}^{\prime}_{i}$ , $i=1,\ldots,d$ . For a square matrix $\bm{K}$ , we use $|\bm{K}|$ to denote its determinant, and $[\bm{K}]_{ij}$ to denote the entry corresponding to the $i$ -th row and $j$ -th column.

2 Related Work

Calibration of Classification Models.

There has been extensive work on obtaining calibrated models in the domain of classification. While there are many methods that do not employ post-processing, we only focus here on methods that employ some form of post-processing. Most forms of post-processing-based calibration for classification fall into the category of conformal methods (Vovk et al., 2005), which, given an input, aim to produce sets of labels that contain the true label with a pre-specified probability. Arguably the two most common forms of calibration are isotonic regression (Niculescu-Mizil & Caruana, 2005) and Platt scaling (Platt et al., 1999). In Niculescu-Mizil & Caruana (2005), Platt scaling and isotonic regression are analyzed extensively for different types of predictive models. In Guo et al. (2017), a modified form of Platt scaling for modern classification neural networks is proposed.

Calibration of Regression Models.

Though initially developed for classification, conformal calibration has been extended to regression settings. In Lakshminarayanan et al. (2017), a calibration approach was proposed for deep ensembles. Gal et al. (2017) propose a dropout-based technique for calibrating deep neural networks. However, these approaches require changing the regressor, potentially deteriorating its predictive performance. It should be noted that Bayesian neural networks (MacKay, 1995), while being able to provide credible sets for the output, fully trust the posterior, resulting in a naively calibrated model that seldom reflects the data’s distribution. As a remedy for this, Kuleshov et al. (2018) present a recalibration approach that scales a model’s predictive quantiles to satisfy the observed data’s distribution. In Vovk et al. (2020), a similar approach is presented, where interpolation between scaling factors is randomized, and a theoretical analysis is provided. An extension of both Kuleshov et al. (2018) and Vovk et al. (2020) and other recalibration is proposed in Marx et al. (2022), along with corresponding theoretical guarantees. While these methods have been shown to yield well-calibrated models, the resulting predictive quantiles are potentially much too crude, resulting in predictions that perform poorly in terms of sharpness, i.e., the corresponding confidence intervals will overestimate the model error by a very large margin. To remedy this, Song et al. (2019) and Kuleshov & Deshpande (2022) propose optimizing the parameters of a recalibration model by obtaining calibration on a distribution level. However, while Song et al. (2019) relies on complex approximations and provides no theoretical guarantees, Kuleshov & Deshpande (2022) does not allow to optimize for sharpness directly, and calibration is only guaranteed asymptotically as the number of data grows.

3 Problem Statement

Consider a compact input space $\mathcal{X}\subset\mathbb{R}^{d}$ , and output space $\mathcal{Y}\subset\mathbb{R}$ , and an unknown data distribution $\Pi$ on $\mathcal{X}\times\mathcal{Y}$ . Consider a model conditioned on training data $\mathcal{D}_{\text{tr}}\sim\Pi$ that, for every $\bm{x}\in\mathcal{X}$ and confidence level $\delta\in[0,1]$ , returns a base prediction $\mu_{\mathcal{D}_{\text{tr}}}(\bm{x})$ and an additive cut-point term $\beta_{\delta}\sigma_{\mathcal{D}_{\text{tr}}}(\delta,\bm{x})$ , where $\sigma_{\mathcal{D}_{\text{tr}}}(\delta,\bm{x})\geq 0$ and $\beta_{\delta}\in\mathbb{R}$ is potentially negative, such that $\mu_{\mathcal{D}_{\text{tr}}}(\bm{x})+\beta_{\delta}\sigma_{\mathcal{D}_{\text{tr}}}(\delta,\bm{x})$ corresponds to a predictive $\delta$ -quantile. The model is then said to be calibrated if

[TABLE]

holds for every $\delta\in[0,1]$ . Furthermore, the calibrated model is also said to be sharp if the corresponding predictive distributions are concentrated (Gneiting et al., 2007), i.e., if the centered confidence intervals induced by the predictive quantiles

[TABLE]

are as small as possible for every $\delta\in[0,1]$ . Our goal is to find a sharply calibrated model based on GP regression.

4 Gaussian Process Regression

In this section, we briefly review GP regression, with particular focus on the choice and influence of hyperparameters.

A GP is formally defined as a collection of random variables, any finite subset of which is jointly Gaussian (Rasmussen & Williams, 2006). It is fully specified by a prior mean function, which we set to zero without loss of generality, and a hyperparameter-dependent covariance function, called kernel $k:\bm{\varTheta}\times\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ , where $\bm{\varTheta}$ denotes the hyperparameter space. The core concept behind GP regression lies in assuming that any finite number of measurements of an unknown function $f:\mathcal{X}\rightarrow\mathbb{R}$ at arbitrary inputs $\bm{x}_{1},\ldots,\bm{x}_{N}\in\mathcal{X}$ are jointly Gaussian with mean zero and covariance $\bm{K}(\bm{\theta})$ , where

[TABLE]

consists of kernel evaluations at the test inputs.

Given a set of noisy observations $\mathcal{D}=\{\bm{x}_{i},y_{i}\}_{i=1}^{N}$ , where $y_{i}\coloneqq f(\bm{x}_{i})+\varepsilon_{i}$ and $\varepsilon_{i}\sim\mathcal{N}(0,\sigma_{0}^{2})$ is iid Gaussian measurement noise, we can condition the GP on them to obtain the posterior distribution. The posterior distribution for a new observation $y^{*}$ at an arbitrary input $\bm{x}^{*}$ is again Gaussian distributed, with mean and variance

[TABLE]

with $\bm{k}(\bm{\theta})\coloneqq\left({k}(\bm{\theta},\bm{x}^{*},\bm{x}_{1}),\ldots,{k}(\bm{\theta},\bm{x}^{*},\bm{x}_{N})\right)$ , $\bm{y}=\left(y_{1},\ldots,y_{N}\right)$ , and $\bm{I}$ denoting the identity matrix. In this paper, we restrict ourselves to kernels $k(\cdot,\cdot,\cdot)$ that yield a posterior variance that is monotonically increasing with respect to the hyperparameters $\bm{\theta}$ , as specified in the following assumption.

Assumption 4.1.

The posterior variance $\sigma_{\mathcal{D}_{\text{tr}}}^{2}(\bm{\theta},\bm{x}^{*})$ is a continuous function of $\bm{\theta}$ . Furthermore, for all hyperparameters $\bm{\theta},\bm{\theta}^{\prime}\in\bm{\varTheta}$ with $\bm{\theta}\leq\bm{\theta}^{\prime}$ , it holds that $\sigma_{\mathcal{D}_{\text{tr}}}^{2}(\bm{\theta},\bm{x}^{*})\leq\sigma_{\mathcal{D}_{\text{tr}}}^{2}(\bm{\theta}^{\prime},\bm{x}^{*})$ .

4.1 holds trivially for the signal variance of a kernel. Moreover, it holds for any hyperparameters that lead to a monotonous increase in the Fourier transform, which is the case, e.g., for the inverse lengtschale of stationary kernels up to a multiplicative factor corresponding to the ratio of the lengthscales (Capone et al., 2022). Furthermore, several results that employ the so-called fill-distance indicate that 4.1 holds for the inverse lengthscale of a broad class of stationary kernels as the lengthscale becomes very large or very small (Wendland, 2004). In our experiments, we observed that 4.1 was never violated for the inverse lengthscale of the squared-exponential kernel. 4.1 will be leveraged to define a cumulative density function by also changing the hyperparameters corresponding to the posterior covariance, as opposed to simply scaling it.

Arguably one of the most challenging aspects of GP regression lies in the choice of hyperparameters $\bm{\theta}$ , as they ultimately determine various characteristics of the posterior, e.g., smoothness and amplitude. In practice, the most common way of choosing $\bm{\theta}$ is by maximizing the log marginal likelihood

[TABLE]

In terms of posterior mean quality, i.e., predictive performance of $\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta},\bm{x}^{*})$ , choosing the hyperparameters in this manner is often the most promising option, since it seeks a trade-off between model complexity and data fit, and has repeatedly been shown to yield a satisfactory mean square error when applied to test data (Rasmussen & Williams, 2006). However, even when employing log-likelihood maximization, the posterior GP distribution is seldom well calibrated, and, in practice, the data often has a significantly different distribution. As a result, quantiles obtained with a purely Bayesian approach are either too confident or too conservative in practice (Capone et al., 2022; Fong & Holmes, 2020). Furthermore, the restrictions imposed by the GP prior often produce a posterior variance $\sigma^{2}(\bm{\theta},\bm{x}^{*})$ that is grossly conservative. See Figure 1 for an illustration.

5 Proposed Approach

In this section, we present our approach to obtaining calibrated GPs by discarding the posterior variance and obtaining alternative predictive quantiles using a quantity inspired by the posterior variance and new hyperparmeters. As we will demonstrate, our approach has numerous advantages. First, by varying the hyperparameters used to obtain predictive quantiles, the resulting quantiles are sharper than what can be obtained simply by multiplying the posterior variance with an appropriate constant. Secondly, by exploiting the monotonicity of the hyperparameters, our approach can be used to obtain intervals for multiple confidence levels $\delta$ in a very efficient manner. Finally, our method is backed by tight theoretical guarantees, obtained by exploiting its connection to conformal prediction.

5.1 Sharp Calibrated GP for Single Confidence Level $\delta$

In the following, we describe how to obtain a sharply calibrated GP for a fixed desired calibration level $\delta$ . Instead of scaling the GP posterior variance to meet the desired calibration level $\delta$ , we propose computing predictive quantiles by training a new quantity similar to the posterior variance but with new hyperparameters. In other words, we discard the posterior variance of the first GP and replace it by a quantity that corresponds to the posterior variance of a different GP, which we train separately. This way, we allow for more degrees of freedom during calibration. We then leverage this additional freedom to minimize the distance of the quantiles to the predictive mean, which yields a sharply calibrated model.

We assume to have a data set $\mathcal{D}$ , which we split into training data $\mathcal{D}_{\text{tr}}$ and calibration (holdout) data $\mathcal{D}_{\text{cal}}$ , with $\mathcal{D}=\mathcal{D}_{\text{tr}}\cup\mathcal{D}_{\text{cal}}$ . The training data $\mathcal{D}_{\text{tr}}$ will be used to compute the posterior, whereas $\mathcal{D}_{\text{cal}}$ will be used to calibrate the model. Note that while not splitting the data might be reasonable when using other types of regression models, e.g., Bayesian or ensemble neural networks, splitting the data in the case of GPs is beneficial for providing accurate quantiles for data out of distribution. This is because, for many commonly used kernels, the GP posterior distribution is considerably more concentrated for test points close to the training data $\mathcal{D}_{\text{tr}}$ than for those far away from $\mathcal{D}_{\text{tr}}$ . Since we wish to obtain a model that is calibrated for data both close and far away from the training data, we take this into account during training by splitting the data. We also assume to have a predictive mean $\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\cdot)$ , corresponding to GP posterior mean function, where the regressor hyperparameters $\bm{\theta}^{R}$ were obtained, e.g., via log-likelihood maximization, as discussed in Section 4. However, note that any other way of choosing the posterior mean hyperparameters $\bm{\theta}^{R}$ is permitted.

Given $\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\cdot)$ and $\mathcal{D}=\mathcal{D}_{\text{tr}}\cup\mathcal{D}_{\text{cal}}$ , we then follow the convention of other recalibration approaches (Kuleshov et al., 2018; Marx et al., 2022) and aim to obtain, for an arbitrary $0\leq\delta\leq 1$ , a scalar $\beta_{\delta}$ and vector of hyperparameters $\bm{\theta}_{\delta}$ , such that the corresponding predictive quantile $\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\cdot)+\beta_{\delta}\sigma_{\mathcal{D}_{\text{tr}}}(\bm{\theta}_{\delta},\cdot)$ contains $\delta$ times the total amount of data points. For this reason, we henceforth refer to $\bm{\theta}_{\delta}$ as calibration hyperparameters.

In order to obtain a sharply calibrated GP model, ideally we would like to choose $\beta_{\delta}$ and $\bm{\theta}_{\delta}$ such that they minimize the expected length of the centered intervals (2) subject to calibration. However, this optimization problem is hard to solve. Hence, we instead attempt to improve model sharpness by concentrating the predictive distribution around the predictive mean $\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\cdot)$ , i.e., by minimizing the deviation of the quantiles from $\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\cdot)$ . This corresponds to solving the optimization problem

[TABLE]

where $\left\{\bm{x}_{\text{cal}}^{i},y_{\text{cal}}^{i}\right\}\in\mathcal{D}_{\text{cal}}$ are samples from the calibration data set, $\Delta y_{\text{cal}}^{i}\coloneqq y_{\text{cal}}^{i}-\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\bm{x}_{\text{cal}}^{i})$ corresponds to the difference between predicted and measured output, and $N_{\text{cal}}=|\mathcal{D}_{\text{cal}}|$ to the number of data points used for calibration.

*Remark 5.1**.*

Note that we employ $N_{\text{cal}}+1$ in the denominator in (5) instead of $N_{\text{cal}}$ . Though this choice makes little difference in practice, we require it for theoretical guarantees.

The equality constraint in (5) is generally infeasible, e.g., if $N_{\text{cal}}=2$ and $\delta=0.5$ , and is discontinuous, making it hard to solve. As it turns out, this can be easily remedied without any detriment to sharpness or calibration, and the problem can be rendered considerably easier to solve by substituting the equality constraint in (5) with

[TABLE]

where $q_{\text{lin}}(\delta,\bm{\Sigma}_{\mathcal{D}_{\text{tr}}}^{-1}\bm{\Delta y_{\text{cal}}})$ is a monotonically increasing piecewise linear function111 For $\bm{a}\in\mathbb{R}^{N_{\text{cal}}}$ , a permutation $i_{1},\ldots,i_{N_{\text{cal}}}\in[1,...,N_{\text{cal}}]$ where $a_{i_{j}}\leq a_{i_{j+1}}$ for all $j$ , and $\delta\in\left[{l}/(N_{\text{cal}}+1),(l+1)/(N_{\text{cal}}+1)\right]$ ,

$q_{\text{lin}}(\delta,\bm{a})=a_{i_{l}}+\left(\delta(N+1)-l\right)\left(a_{i_{l+1}}-a_{i_{l}}\right).$

that maps $\delta=j/(N_{\text{cal}}+1)$ to the $j$ -th smallest entry of $\bm{\Sigma}_{\mathcal{D}_{\text{tr}}}^{-1}\bm{\Delta y_{\text{cal}}}$ , and the entries of the vector

[TABLE]

correspond to the z-scores of the data under the calibration standard deviation $\sigma_{\mathcal{D}_{\text{tr}}}\left(\bm{\theta}_{\delta},\cdot\right)$ . The original problem (5) then becomes

[TABLE]

which is considerably easier to solve due to the lack of constraints, and enables us to use gradient-based approaches, since $q_{\text{lin}}(\delta,\bm{\Sigma}_{\mathcal{D}_{\text{tr}}}^{-1}\bm{\Delta y_{\text{cal}}})$ is differentiable with respect to $\sigma_{\mathcal{D}_{\text{tr}}}(\bm{\theta}_{\delta},\bm{x}_{\text{cal}}^{i})$ .

*Remark 5.2**.*

The choice of interpolant (6) is due to its simplicity. However, other forms of monotone interpolation are also possible. For high data sizes, the choice of interpolant becomes of little relevance, since we only perform small interpolation steps.

5.2 Calibrated GP for Arbitrary Confidence Level $\delta$

While (7) is useful for obtaining a sharply calibrated model for a single confidence level $\delta$ , solving (7) multiple times whenever we want to obtain sharply calibrated models for different confidence levels $\delta$ can be time-consuming. Furthermore, interpolating between any two arbitrary solutions of (7) won’t necessarily yield a result close to the desired calibration. Fortunately, we can leverage 4.1 to show that interpolating between two solutions of (7) will yield a result close to the desired calibration, provided that we interpolate between two strictly increasing or decreasing sets of hyperparameters. Formally, this is achieved by solving (7) $N_{\text{cal}}+1$ times to obtain $\beta_{0},\beta_{\delta_{1}},\ldots,\beta_{\delta_{N}}\in\mathbb{R}$ and $\bm{\theta}_{0},\bm{\theta}_{\delta_{1}},\ldots,\bm{\theta}_{\delta_{N}}\in\bm{\varTheta}$ , subject to two additional constraints. First, the calibration scaling parameters $\beta_{\delta}$ must be monotonically increasing with $\delta$ , i.e.,

[TABLE]

and the calibration hyperparameters $\bm{\theta}_{\delta}$ must be decreasing with $\delta$ if $\beta_{\delta}$ is negative, and increasing if $\beta_{\delta}$ is positive, i.e.,

[TABLE]

In other words, the entries of $\bm{\theta}_{\delta}$ are strictly decreasing with $\beta_{\delta}$ up until the sign of $\beta_{\delta}$ switches, after which they are increasing. The reason why we impose these restrictions is that we can then confidently interpolate between any values of $\beta_{\delta}$ and $\bm{\theta}_{\delta}$ , since 4.1 implies that the quantile stipulated by $\beta_{\delta}\sigma_{\mathcal{D}_{\text{tr}}}(\bm{\theta},\cdot)$ is monotonically increasing with $\delta$ . We then train simple piecewise linear interpolation models $\hat{\beta}:[0,1]\rightarrow\mathbb{R}$ and $\hat{\bm{\theta}}:[0,1]\rightarrow\bm{\varTheta}$ , such that $\hat{\beta}(\delta_{i})=\beta_{\delta_{i}}$ and $\hat{\bm{\theta}}(\delta_{i})=\bm{\theta}_{\delta_{i}}$ , with the additional constraint that $\hat{\bm{\theta}}(\delta)$ reaches a minimum whenever $\hat{\beta}(\delta)=0$ , which can be potentially achieved by adding an artificial vector of training hyperparameters $\bm{\theta}_{\delta}$ for computing $\hat{\bm{\theta}}(\delta)$ . Note, however, that any other form of monotone interpolation is also acceptable for obtaining $\hat{\beta}(\delta)$ and $\hat{\bm{\theta}}(\delta)$ . The procedure is summarized in Algorithm 1.

*Remark 5.3**.*

While lengthscale constraints of the form $\bm{\theta}_{\delta_{j}}\leq\bm{\theta}_{\delta_{j+1}}$ can be easily enforced when solving (7) by substituting $\bm{\theta}$ with $\bm{\theta}_{\delta_{j}}+\Delta\bm{\theta}$ and minimizing over the logarithm of $\Delta\bm{\theta}$ , the constraint $\beta_{\delta_{j}}\leq\beta_{\delta_{j+1}}$ is not enforced in (7). However, in practice we were often able to find local minima of (7) that satisfy this requirement.

We can then easily show that our approach achieves an arbitrary calibration level as the amount of data grows, provided that we choose the confidence levels $\delta$ accordingly.

Theorem 5.4.

Let $y$ be absolutely continuous conditioned on $\bm{x}$ , let $\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\cdot)$ be a posterior GP mean, and let $\sigma_{\mathcal{D}_{\text{tr}}}(\cdot,\cdot)$ be a GP posterior variance conditioned on $\mathcal{D}_{\text{tr}}$ . Then, for any calibration data set $\mathcal{D}_{\text{cal}}=\left\{\bm{x}_{\text{cal}}^{i},y_{\text{cal}}^{i}\right\}$ , choose

[TABLE]

*and let $\hat{\beta}(\cdot)$ and $\hat{\bm{\theta}}(\cdot)$ be interpolation models obtained with Algorithm 1 and confidence levels $\delta_{1},...,\delta_{N_{\text{cal}}}$ . Then *

[TABLE]

Proof.

The proof can be found in the supplementary material. ∎

Note that Theorem 5.4 also implies that a single set of calibration parameters $\beta_{\delta}$ and $\bm{\theta}_{\delta}$ obtained by solving (7), since we can substitute $\hat{\beta}(\delta)=\beta_{\delta}$ and $\hat{\bm{\theta}}(\delta)=\bm{\theta}_{\delta}$ into (8).

6 Discussion

Computational Complexity.

Much like hyperparameter optimization for standard GPs (Rasmussen & Williams, 2006), the major driver behind the computational complexity in our approach stems from the need to invert the covariance matrix, an operation that scales cubically with the amount of data. In order to alleviate the computational cost of our approach, we can resort to different tools that improve scalability (Liu et al., 2020). One approach consists of employing only a subset of the training data $\mathcal{D}_{\text{tr}}$ to choose the calibration hyperparameters $\bm{\theta}_{\delta}$ , and then the full data set to choose $\beta_{\delta}$ . While this potentially leads to a loss in sharpness compared to when using the full data set, it still guarantees a calibrated model. Our technique is also readily applicable to sparse GPs (Snelson & Ghahramani, 2005; Titsias, 2009), as 4.1 typically still holds. This option is also explored in numerical experiments, in Section 7. Moreover, in many settings a specific level of calibration is often required, as opposed to several different ones, e.g., in stochastic model predictive control, where chance constraints corresponding to a fixed risk have to be satisfied (Mesbah, 2016). In such settings, we potentially only have to train a single vector of calibration hyperparameters $\bm{\theta}_{\delta}$ , which reduces computational cost.

Initialization and Solution.

The choice of initial hyperparameters can affect the optimization results considerably, and choosing a good hyperparameter initialization can be challenging, as is true when choosing the hyperparameters for the predictive mean. While this can be partially addressed by employing random restarts, we can also reuse trained models for similar calibration levels, since it is reasonable to expect that only small changes to the calibration hyperparameters are required to achieve a slight increase or decrease in confidence level. Furthermore, we can also simplify the problem by considering only a scaled version of the regression hyperparameters $\bm{\theta}^{R}$ to compute $\bm{\theta}_{\delta}$ , which would reduce the optimization problem to a line search.

7 Experiments

In this section, we apply and analyze our approach using a toy data set and different regression benchmark data sets from the UCI repository. In the supplementary material, we also compare our approach to that of Capone et al. (2022) when used to obtain uniform error bounds and apply our method to two different Bayesian optimization problems.

The goal is for our approach to obtain a sharp calibrated regression model for each data set in the calibrated regression experiments. We test our approach on various data sets and compare it to the state-of-the-art recalibration approaches by Kuleshov et al. (2018) and Vovk et al. (2020), the point predictor (posterior variance-free) approach proposed in Marx et al. (2022), as well as the check-score-based approach of Kuleshov & Deshpande (2022). The technique proposed by Kuleshov et al. (2018) essentially multiplies the vanilla posterior standard deviation $\sigma_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\cdot)$ with the recalibrated z-score, such that the confidence level observed on the calibration data matches that of the desired confidence level. Vovk et al. (2020) employ a similar approach, except that random interpolation is employed to compute new scaling values. The point predictor-based method proposed in Marx et al. (2022) discards the posterior standard deviation $\sigma_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\cdot)$ and computes a constant scalar that is added to the predictive mean and used to compute quantiles everywhere within the input space. The method of Kuleshov & Deshpande (2022) trains a neural network using a quantile loss, which takes base quantiles as inputs and returns new, recalibrated quantiles. As a recalibrator for the method of Kuleshov & Deshpande (2022), we employ the same neural network architecture suggested in their paper, trained over $200$ epochs, with additional pretraining over $2000$ epochs using a single dataset for the UCI experiments. In all experiments except kin8nm and Facebook comment volume 2, we employ standard GPs with automatic relevance determination squared-exponential (ARD-SE) kernels and zero prior mean as base models, trained using log-likelihood maximization. For the kin8nm and Facebook comment volume 2 datasets, we employ sparse GPs (Titsias, 2009) with zero prior mean, ARD-SE kernels, and $300$ inducing points.

7.1 Toy Data Set

The first regression data set corresponds to a one-dimensional synthetic data set, where the results can be easily displayed visually. The main purpose of this section is to give an intuition as to how our approach computes confidence intervals compared to other techniques. We investigate the performance of our approach and compare it to other methods when employed to compute centered $99\%$ confidence intervals. We observe that the confidence intervals obtained with our approach peak less strongly far away from the data while being tight near the data compared to all other approaches except the one of Kuleshov & Deshpande (2022). This is because we allow the lengthscale to change to obtain a calibrated model. In contrast, all other methods except that of Kuleshov & Deshpande (2022) scale the standard GP posterior variance without changing hyperparameters. The method of Kuleshov & Deshpande (2022), which uses a neural network as a recalibrator, offers additional flexibility, resulting in sharper confidence intervals. However, calibration is not explicitly enforced during training, resulting in poor calibration. The results are depicted in Figure 2.

7.2 Benchmark Data Sets

We now experiment with seven different regression data sets from the UCI repository, two containing over eight thousand data points and requiring sparse GP approximations. The training/calibration/test split is $0.6$ , $0.2$ , and $0.2$ for all data sets except the Facebook comment volume 2 data set, which contains over $80000$ data points, and where the split is $0.08$ , $0.02$ , and $0.9$ . For the approach of Kuleshov & Deshpande (2022), we follow the steps in their paper and limit the calibration data size to $500$ .

We assess performance by employing diagnostic tools commonly used to assess calibration and sharpness (Kuleshov et al., 2018; Marx et al., 2022; Gneiting et al., 2007)). The score used to quantify calibration is the calibration error (Kuleshov et al., 2018), given by

[TABLE]

where $p_{j}$ corresponds to the $j$ -th desired confidence level, chosen, e.g., evenly spaced between [math] and $1$ , and $\hat{p}_{j}$ is the observed confidence level, i.e.,

[TABLE]

Here the superscript $*$ denotes test inputs and outputs, $T$ denotes the total number of test points, and $\Delta y_{t}^{*}\coloneqq\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\bm{x}_{t}^{*})-y_{t}^{*}$ . We employ $m=21$ evenly spaced values between [math] and $1$ for ${p}_{j}$ . To measure sharpness, we employ the average length of the $95\%$ confidence interval, the average standard deviation of the predictive distribution, and the average negative log-likelihood of the predictions (Gneiting et al., 2007; Marx et al., 2022). Note that since every model outputs a quantile for any desired calibration level, the corresponding negative log-likelihood and average standard deviation are well specified. These are computed by employing the cumulative distribution function, obtained by inverting the quantile function specified by each model.

We carried out each experiment $100$ times and report the resulting average expected calibration error, standard deviation, negative log-likelihood, and length of the centered 95 $\%$ confidence intervals in Table 1. Our approach performs best or marginally worse than all other calibration approaches regarding expected calibration error. This is to be expected from Theorem 5.4. Furthermore, it outperforms all approaches except that of Kuleshov & Deshpande (2022) in sharpness. However, the improved sharpness of the method of Kuleshov & Deshpande (2022) comes at the expense of calibration.

8 Conclusion

We have presented a calibration method for Gaussian process regression that leverages the monotonicity properties of the kernel hyperparameters to obtain sharp calibrated models. We show that, under reasonable assumptions, our method yields an accurately calibrated model as the size of data used for calibration increases. When applied to different regression benchmark data sets, our approach was shown to be competitive in sharpness compared to state-of-the-art recalibration methods without sacrificing calibration performance. It is worth stressing that, though the tools presented here emerge naturally from a Gaussian process setting, we do not require our predictor to be a Gaussian process to obtain theoretical guarantees. In future work, we aim to leverage similar monotonicity characteristics to get sharply calibrated models using tools different from Gaussian processes. Furthermore, we aim to experiment with inducing variables as hyperparameters when optimizing the models for sharpness.

Acknowledgements

This work was supported in part by the European Research Council Consolidator Grant Safe data-driven control for human-centric systems (CO-MAN) under grant agreement number 864686.

Proof of Theorem 5.4

For completeness, we state Theorem 1 from Marx et al. (2022) here in adapted form, which we then use to prove Theorem 5.4.

Lemma 8.1 (Marx et al. (2022), Theorem 1).

Let $\varphi:\mathcal{X}\times\mathbb{R}\rightarrow\mathbb{R}$ be a function such that that $\varphi(\bm{x},y)$ , $\bm{x},y\sim\Pi$ is an absolutely continuous random variable and, for any fixed $\bm{x}^{*}\in\mathcal{X}$ , $\varphi(\bm{x}^{*},\cdot)$ is strictly monotonically increasing. Furthermore, for a set of calibration data $\mathcal{D}_{\text{cal}}=\left\{\bm{x}_{\text{cal}}^{i},y_{\text{cal}}^{i}\right\}$ with $N_{\text{cal}}=|\mathcal{D}_{\text{cal}}|$ and a permutation $i_{1},\ldots,i_{N_{\text{cal}}}\in[1,2,\ldots,N_{\text{cal}}]$ such that

[TABLE]

let $H:\mathbb{R}\rightarrow[0,1]$ be a monotonically non-decreasing function, such that $H(\varphi(\bm{x}_{\text{cal}}^{i_{j}},y_{\text{cal}}^{i_{j}}))=\frac{j}{N_{\text{cal}}+1}$ holds for all $j=1,\ldots,N_{\text{cal}}$ . Then

[TABLE]

The idea behind the proof of Theorem 5.4 is to show that the solution $\varphi(\bm{x},y)$ of the implicit equation

[TABLE]

satisfies the requirements stipulated by Lemma 8.1, where $\tilde{\beta}(\delta)$ and $\tilde{\bm{\theta}}(\delta)$ are arbitrary continuous functions such that

[TABLE]

$\tilde{\bm{\theta}}(\delta)$ is monotonically increasing with respect to $\delta$ for all $\delta\in\{\delta\in\mathbb{R}\ |\ \tilde{\beta}(\delta)>0\}$ , and monotonically decreasing with respect to $\delta$ for all $\delta\in\{\delta\in\mathbb{R}\ |\ \tilde{\beta}(\delta)<0\}$ . Note that the functions $\hat{\beta}(\delta)$ and $\hat{\bm{\theta}}(\delta)$ can be easily extended within the real axis to satisfy the requirements mentioned above, which means that they are contained within the set from which $\tilde{\beta}(\delta)$ and $\tilde{\bm{\theta}}(\delta)$ . The reason why we choose arbitrary $\tilde{\beta}(\delta)$ and $\tilde{\bm{\theta}}(\delta)$ , as opposed to the functions $\hat{\beta}(\delta)$ and $\hat{\bm{\theta}}(\delta)$ , is because we need $\varphi(\bm{x},y)$ to be independent of the calibration data $\mathcal{D}_{\text{cal}}$ in order to be able to employ Lemma 8.1. Showing that $\varphi(\bm{x},y)$ satisfies the requirements of Lemma 8.1 for any $\tilde{\beta}(\delta)$ and $\tilde{\bm{\theta}}(\delta)$ then implies that we can also choose any function within this class that minimizes sharpness, meaning that these properties also extend to $\hat{\beta}(\delta)$ and $\hat{\bm{\theta}}(\delta)$ .

To prove Theorem 5.4, we will require the following result.

Lemma 8.2.

Consider the regressor $\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\cdot)$ , and let $\tilde{\beta}(\delta)$ and $\tilde{\bm{\theta}}(\delta)$ be functions that satisfy (11) and (12). Then, for arbitrary fixed $y$ and $\bm{x}$ ,

[TABLE]

is strictly monotonically decreasing with $\delta$ .

Proof. The proof follows directly from 4.1 and the properties (11) and (12). ∎

Proof of Theorem 5.4. Let $\tilde{\beta}(\delta)$ and $\tilde{\bm{\theta}}(\delta)$ be functions that satisfy (11) and (12). Due to Lemma 8.2, we can define the function $\varphi:\mathcal{X}\times\mathbb{R}\rightarrow[0,1]$ as the unique solution to the implicit equation

[TABLE]

Note that, since $y-\mu_{\mathcal{D}_{\text{tr}}}(\bm{\theta}^{R},\bm{x})$ is strictly monotonically increasing with $y$ , $\varphi(\bm{x},y)$ is a strictly monotonically increasing function of $y$ for any fixed $\bm{x}$ . Furthermore, since $y$ is absolutely continuous,

[TABLE]

holds for all $i\neq j$ almost surely, which implies $\varphi(\bm{x}_{\text{cal}}^{i},y_{\text{cal}}^{i})\neq\varphi(\bm{x}_{\text{cal}}^{j},y_{\text{cal}}^{j})$ for all $i\neq j$ almost surely. Hence, $\varphi(\bm{x},y)$ , $\bm{x},y\sim\Pi$ , corresponds to an absolutely continuous random variable. Hence, given any monotonically non-decreasing function $H(\cdot)$ that satisfies the requirement

[TABLE]

Lemma 8.1 implies that

[TABLE]

Since $\tilde{\beta}(\delta)$ and $\hat{\bm{\theta}}\left(\delta\right)$ are arbitrary, and $\hat{\beta}(\delta)$ and $\tilde{\bm{\theta}}\left(\delta\right)$ are continuous and also satisfy (11) and (12) within $\delta\in[0,1]$ , we can substitute $\varphi(\cdot,\cdot)$ in (15) with $\hat{\varphi}(\cdot,\cdot)$ , which is the unique solution of the implicit equation

[TABLE]

Now, in the particular case of $\hat{\varphi}(\cdot,\cdot)$ , due to (7), we have that

[TABLE]

meaning that $H(\hat{\varphi}(\bm{x}_{\text{cal}}^{i_{j}},y_{\text{cal}}^{i_{j}}))=\hat{\varphi}(\bm{x}_{\text{cal}}^{i_{j}},y_{\text{cal}}^{i_{j}})$ , i.e., (15) holds for $\varphi(\cdot,\cdot)=\hat{\varphi}(\cdot,\cdot)$ and the identity function $H(\delta)=\delta$ . Furthermore, since $\hat{\varphi}(\cdot,\cdot)$ is uniquely defined by the implicit equation (16) and $\hat{\beta}(\delta)\sigma_{\mathcal{D}_{\text{tr}}}(\hat{\bm{\theta}}(\delta),\bm{x})$ is monotonically increasing with $\delta$ , this in turn implies

[TABLE]

Since $\tilde{\beta}(\delta)$ and $\hat{\bm{\theta}}\left(\delta\right)$ are arbitrary, and $\hat{\beta}(\delta)$ and $\tilde{\bm{\theta}}\left(\delta\right)$ which, together with (15), implies the desired result. ∎

Comparison with Capone et al. (2022)

In this section, we briefly examine how our approach compares to that of Capone et al. (2022) when used to compute uniform error bounds, i.e., $100$ percent credible intervals, for three different data sets. We carried out each experiment $10$ times. In the following, we report the rate of uniform error bound violation and the average length of the $100$ percent credible intervals. The method of Capone et al. (2022) is purely Bayesian and thus heavily dependent on the prior. The resulting credible intervals are well-calibrated, i.e., they cover most of the data. However, our approach is much better regarding sharpness. This is because Capone et al. (2022) is Bayesian and requires symmetric intervals, whereas our approach is frequentist and allows for asymmetric credible intervals. Our approach also exhibits a lower rate of uniform error bound violations than Capone et al. (2022) in most cases, which suggests that a frequentist approach is more adequate for computing uniform error bounds than a Bayesian one.

Bayesian Optimization

We now investigate how the proposed calibration approach can be employed in a Bayesian optimization context using two commonly used benchmark functions, the Ackley and Rosenbrock functions.

In Bayesian optimization, the goal is to find a point in input space that maximizes an unknown function $f(\cdot)$ . In particular, we investigate how our calibrated GP bound performs when used as an upper confidence bound (UCB) for a GP-UCB type acquisition function. Simply put, given a data set ${\mathcal{D}_{t}}$ of size $t$ , the GP-UCB algorithm chooses a query point by maximizing the acquisition function

[TABLE]

where $\beta_{\mathcal{D}_{t}}$ is a tuning parameter that stipulates the trade-off between exploration and exploitation, and may or may not depend on the data set $\mathcal{D}_{t}$ . It has been shown that if the unknown function $f(\cdot)$ belongs to the RKHS associated with the kernel $k(\bm{\theta}^{R},\cdot,\cdot)$ , and $\beta_{\mathcal{D}_{t}}$ is chosen sufficiently large, then the GP-UCB achieves sublinear regret (Chowdhury & Gopalan, 2017). However, both assumptions typically cannot be verified in practice, and choosing both the kernel $k(\bm{\theta}^{R},\cdot,\cdot)$ and the scaling factor $\beta_{\mathcal{D}_{t}}$ in a principled manner remains an open problem. We propose employing the modified acquisition function

[TABLE]

where the hyperparameters $\bm{\theta}_{\delta}$ are obtained via a calibrated model and a suitable choice of confidence parameter $\delta$ . In the experiments, we set $\beta_{\mathcal{D}_{t}}=1$ and compute the calibrated hyperparameters by setting $\delta=0.01$ , meaning that we set expect only one percent of the evaluations to lie outside the confidence region. Note that even though the underlying function is fixed, it is reasonable to expect that some of the data lies outside the confidence region due to noise, and we can only expect the data to lie fully within the confidence region in the noiseless case, which we do not consider in this paper. Furthermore, we refrain from retraining the hyperparameters after each data point is collected, following the convention of other Bayesian optimization approaches (Srinivas et al., 2012; Chowdhury & Gopalan, 2017). While this does not enable us to employ the theoretical guarantees developed in Section 5, it reduces computational time significantly. We additionally compare our results to the vanilla UCB algorithm, where the hyperparameters, chosen via log-likelihood maximization, are identical for both the posterior mean and variance, and we set $\beta_{\mathcal{D}_{t}}=2$ .

We evaluate the results both in terms of cumulative regret and simple regret. Cumulative regret after $T$ steps corresponds to the metric

[TABLE]

whereas simple regret is given by

[TABLE]

Typically, a Bayesian optimization algorithm is deemed useful if cumulative regret exhibits sublinear growth, implying that the average regret goes to zero. Simple regret, by contrast, corresponds to the best query among all past queries and is an important metric whenever evaluation costs are low (Berkenkamp et al., 2019).

In the case of the Ackley experiment, our approach typically chose lengthscales that were smaller than those computed via likelihood maximization. This results in more exhaustive exploration than vanilla UCB, which in turn means that local minima are explored more carefully before the focus of the optimization is shifted elsewhere. This results in better performance than when using vanilla UCB, both in terms of cumulative and simple regret. The results correspond to the top two figures in Figure 3.

In contrast to the Ackley experiment, in the Rosenbrock experiment our approach selects lengthscales that are larger than those suggested by the likelihood maximum. Roughly speaking, this means that the confidence intervals produced by the likelihood maximum hyperparameters are too conservative, and our approach attempts to compensate for this by indicating more confidence in the posterior mean obtained with the vanilla GP. This means that local minima are explored less meticulously than with the vanilla UCB algorithm. This choice is justified by the cumulative regret obtained with our approach, as it is slightly smaller than that obtained by the vanilla UCB algorithm. However, this also results in worse simple regret than the vanilla UCB algorithm, which is intuitive, as our approach opts to explore local minima less accurately than the vanilla UCB algorithm. We also note that both algorithms converge towards the same simple regret as the number of iterations increases. The results correspond to the bottom two figures in Figure 3.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Berkenkamp et al. (2019) Berkenkamp, F., Schoellig, A. P., and Krause, A. No-regret bayesian optimization with unknown hyperparameters. Journal of Machine Learning Research , 20(50):1–24, 2019. URL http://jmlr.org/papers/v 20/18-213.html .
2Capone et al. (2022) Capone, A., Lederer, A., and Hirche, S. Gaussian process uniform error bounds with unknown hyperparameters for safety-critical applications. In Proceedings of the 39th International Conference on Machine Learning , volume 162 of Proceedings of Machine Learning Research , pp. 2609–2624. PMLR, 2022. URL https://proceedings.mlr.press/v 162/capone 22a.html .
3Chowdhury & Gopalan (2017) Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pp. 844–853, International Convention Centre, Sydney, Australia, 2017. PMLR. URL http://proceedings.mlr.press/v 70/chowdhury 17a.html .
4Deisenroth et al. (2015) Deisenroth, M. P., Fox, D., and Rasmussen, C. E. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence , 37(2):408–423, 2015.
5Deringer et al. (2021) Deringer, V. L., Bartók, A. P., Bernstein, N., Wilkins, D. M., Ceriotti, M., and Csányi, G. Gaussian process regression for materials and molecules. Chemical Reviews , 121(16):10073–10141, 2021. doi: 10.1021/acs.chemrev.1c 00022 . URL https://doi.org/10.1021/acs.chemrev.1c 00022 . PMID: 34398616. · doi ↗
6Fong & Holmes (2020) Fong, E. and Holmes, C. C. On the marginal likelihood and cross-validation. Biometrika , 107(2):489–496, 2020. doi: 10.1093/biomet/asz 077 . URL https://doi.org/10.1093/biomet/asz 077 . · doi ↗
7Gal et al. (2017) Gal, Y., Hron, J., and Kendall, A. Concrete dropout. In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/84ddfb 34126 fc 3a 48ee 38d 7044 e 87276-Paper.pdf .
8Gneiting et al. (2007) Gneiting, T., Balabdaoui, F., and Raftery, A. E. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 69(2):243–268, 2007.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Sharp Calibrated Gaussian Processes

Abstract

1 Introduction

Notation.

2 Related Work

Calibration of Classification Models.

Calibration of Regression Models.

3 Problem Statement

4 Gaussian Process Regression

Assumption 4.1**.**

5 Proposed Approach

5.1 Sharp Calibrated GP for Single Confidence Level δ\deltaδ

Remark 5.1*.*

Remark 5.2*.*

5.2 Calibrated GP for Arbitrary Confidence Level δ\deltaδ

Remark 5.3*.*

Theorem 5.4**.**

Proof.

6 Discussion

Computational Complexity.

Initialization and Solution.

7 Experiments

7.1 Toy Data Set

7.2 Benchmark Data Sets

8 Conclusion

Acknowledgements

Proof of Theorem 5.4

Lemma 8.1** (Marx et al. (2022), Theorem 1).**

Lemma 8.2**.**

Comparison with Capone et al. (2022)

Bayesian Optimization

Assumption 4.1.

5.1 Sharp Calibrated GP for Single Confidence Level $\delta$

*Remark 5.1**.*

*Remark 5.2**.*

5.2 Calibrated GP for Arbitrary Confidence Level $\delta$

*Remark 5.3**.*

Theorem 5.4.

Lemma 8.1 (Marx et al. (2022), Theorem 1).

Lemma 8.2.