Sharp Calibrated Gaussian Processes
Alexandre Capone, Geoff Pleiss, Sandra Hirche

TL;DR
This paper introduces a flexible calibration method for Gaussian processes that improves the accuracy of predictive quantiles, ensuring better frequentist calibration and sharper confidence intervals in regression tasks.
Contribution
A novel calibration approach for Gaussian processes that uses hyperparameters optimized for empirical calibration, enhancing the sharpness and reliability of predictive intervals.
Findings
Outperforms existing calibration methods in sharpness.
Yields well-calibrated predictive quantiles.
Applicable under reasonable assumptions.
Abstract
While Gaussian processes are a mainstay for various engineering and scientific applications, the uncertainty estimates don't satisfy frequentist guarantees and can be miscalibrated in practice. State-of-the-art approaches for designing calibrated models rely on inflating the Gaussian process posterior variance, which yields confidence intervals that are potentially too coarse. To remedy this, we present a calibration approach that generates predictive quantiles using a computation inspired by the vanilla Gaussian process posterior variance but using a different set of hyperparameters chosen to satisfy an empirical calibration constraint. This results in a calibration approach that is considerably more flexible than existing approaches, which we optimize to yield tight predictive quantiles. Our approach is shown to yield a calibrated model under reasonable assumptions. Furthermore, it…
| Data set | Metric | Ours | RK | RV | RM | NN | B |
|---|---|---|---|---|---|---|---|
| ECE | 0.003 | 0.0029 | 0.0029 | 0.0029 | 0.0056 | 0.041 | |
| STD | 0.16 | 0.31 | 0.3 | 0.33 | 0.22 | 1.9 | |
| Boston | NLL | 0.21 | 0.39 | 0.4 | 0.42 | -0.24 | 1.6 |
| 95 CI | 0.76 | 1.4 | 1.4 | 1.4 | 0.73 | 7.4 | |
| ECE | 0.0044 | 0.0043 | 0.0044 | 0.0043 | 0.0081 | 0.039 | |
| STD | 0.16 | 0.5 | 0.47 | 0.5 | 0.14 | 2.8 | |
| yacht | NLL | 0.26 | 0.68 | 0.69 | 0.68 | -2 | 2 |
| 95 CI | 0.76 | 2.3 | 2.3 | 2.3 | 0.3 | 11 | |
| ECE | 0.0036 | 0.0035 | 0.0035 | 0.0035 | 0.0053 | 0.044 | |
| STD | 0.13 | 0.38 | 0.38 | 0.38 | 0.29 | 2.8 | |
| mpg | NLL | 0.032 | 0.63 | 0.64 | 0.63 | 0.02 | 2 |
| 95 CI | 0.6 | 1.7 | 1.8 | 1.7 | 0.96 | 11 | |
| ECE | 0.00047 | 0.00047 | 0.00047 | 0.00047 | 0.0067 | 0.0058 | |
| STD | 0.54 | 1 | 1 | 0.88 | 0.72 | 1.4 | |
| wine | NLL | 1.2 | 1.3 | 1.3 | 1.3 | -0.36 | 1.4 |
| 95 CI | 2.1 | 3.8 | 3.8 | 3.9 | 2.8 | 5.4 | |
| ECE | 0.00071 | 0.00064 | 0.00064 | 0.00064 | 0.00076 | 0.032 | |
| STD | 0.25 | 0.64 | 0.64 | 0.64 | 0.57 | 2.8 | |
| concrete | NLL | 0.72 | 1.1 | 1.1 | 1.1 | 0.85 | 2 |
| 95 CI | 0.93 | 2.5 | 2.5 | 2.5 | 2.1 | 11 | |
| ECE | 0.00016 | 0.00016 | 0.00016 | 0.00016 | 0.00053 | 0.028 | |
| STD | 0.074 | 0.12 | 0.12 | 0.12 | 0.098 | 0.4 | |
| kin8nm | NLL | -0.54 | -0.65 | -0.65 | -0.63 | -0.76 | 0.1 |
| 95 CI | 0.26 | 0.47 | 0.47 | 0.48 | 0.44 | 1.6 | |
| ECE | 0.00044 | 0.00043 | 0.00043 | 0.00045 | 0.0089 | 0.044 | |
| STD | 0.068 | 0.18 | 0.18 | 0.18 | 0.18 | 1.2 | |
| Facebook2 | NLL | 3.6 | -1.3 | -1.3 | -1.2 | -2.3 | 1.2 |
| 95 CI | 0.6 | 1.7 | 1.7 | 1.7 | 3.4 | 4.6 |
| Data set | Metric | Ours | Capone et al. (2022) |
|---|---|---|---|
| Rate of uniform error bound violation | 0.00376 | 0.000172 | |
| Boston | Length of 100 CI | 1.2 | 28.3 |
| Rate of uniform error bound violation | 0.004 | 0.0065 | |
| mpg | Length of 100 CI | 1.7 | 24.13 |
| Rate of uniform error bound violation | 0.00072 | 0.00096 | |
| wine | Length of 100 CI | 4.7 | 24.8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Fault Detection and Control Systems · Advanced Control Systems Optimization
MethodsGaussian Process
Sharp Calibrated Gaussian Processes
Alexandre Capone
Technical University of Munich
&Sandra Hirche
Technical University of Munich
Geoff Pleiss
University of British Columbia
Vector Institute
Abstract
While Gaussian processes are a mainstay for various engineering and scientific applications, the uncertainty estimates don’t satisfy frequentist guarantees and can be miscalibrated in practice. State-of-the-art approaches for designing calibrated models rely on inflating the Gaussian process posterior variance, which yields confidence intervals that are potentially too coarse. To remedy this, we present a calibration approach that generates predictive quantiles using a computation inspired by the vanilla Gaussian process posterior variance but using a different set of hyperparameters chosen to satisfy an empirical calibration constraint. This results in a calibration approach that is considerably more flexible than existing approaches, which we optimize to yield tight predictive quantiles. Our approach is shown to yield a calibrated model under reasonable assumptions. Furthermore, it outperforms existing approaches in sharpness when employed for calibrated regression.
1 Introduction
Gaussian process (GP) regression offers an ambitious proposition: by conditioning a model on measurement data, we are provided with a Gaussian probability distribution for the unseen data. Assuming that the posterior probability distribution holds, we can then directly calibrate our model using the inverse error function. Though the distribution of unseen data seldom follows the Gaussian prior distribution, and the GP generally does not adapt adequately to the observed distributions after being conditioned on the data, GPs have become one of the most powerful and established regression techniques. Besides having found widespread use in machine learning (Deisenroth et al., 2015; Srinivas et al., 2012), their good generalization properties have motivated applications in the fields of control (Kocijan, 2016), astrophysics (Roberts et al., 2013) and chemistry (Deringer et al., 2021), to name a few. Furthermore, the Bayesian paradigm offers a powerful tool to analyze the theoretical properties of different regression techniques (Srinivas et al., 2012; Capone et al., 2022).
In this paper, we present a novel approach to obtaining sharp calibrated Gaussian processes, i.e., Gaussian processes that provide concentrated predictive distributions that accurately match the observed data. Instead of computing confidence intervals by inflating the Gaussian process posterior variance, our approach discards it and computes a new quantity inspired by the computation of the posterior variance, where all hyperparameters are chosen in a way that results in both accurate and sharp calibration. In other words, we train two separate Gaussian processes: one for the predictive mean and one for obtaining predictive quantiles, which is exclusively used for calibration purposes. By doing so, we reach considerably more flexibility than existing calibration approaches, which enables us to additionally optimize the sharpness of the calibration. Our approach outperforms several state-of-the-art calibration approaches in terms of sharpness while still yielding similar calibration performance. Furthermore, it is competitive compared to a neural network-based method in sharpness without sacrificing calibration performance.
Notation.
We use to denote the non-negative real numbers. Boldface lowercase/uppercase characters denote vectors/matrices. For two vectors and in , we employ the notation to denote componentwise inequality, i.e., , . For a square matrix , we use to denote its determinant, and to denote the entry corresponding to the -th row and -th column.
2 Related Work
Calibration of Classification Models.
There has been extensive work on obtaining calibrated models in the domain of classification. While there are many methods that do not employ post-processing, we only focus here on methods that employ some form of post-processing. Most forms of post-processing-based calibration for classification fall into the category of conformal methods (Vovk et al., 2005), which, given an input, aim to produce sets of labels that contain the true label with a pre-specified probability. Arguably the two most common forms of calibration are isotonic regression (Niculescu-Mizil & Caruana, 2005) and Platt scaling (Platt et al., 1999). In Niculescu-Mizil & Caruana (2005), Platt scaling and isotonic regression are analyzed extensively for different types of predictive models. In Guo et al. (2017), a modified form of Platt scaling for modern classification neural networks is proposed.
Calibration of Regression Models.
Though initially developed for classification, conformal calibration has been extended to regression settings. In Lakshminarayanan et al. (2017), a calibration approach was proposed for deep ensembles. Gal et al. (2017) propose a dropout-based technique for calibrating deep neural networks. However, these approaches require changing the regressor, potentially deteriorating its predictive performance. It should be noted that Bayesian neural networks (MacKay, 1995), while being able to provide credible sets for the output, fully trust the posterior, resulting in a naively calibrated model that seldom reflects the data’s distribution. As a remedy for this, Kuleshov et al. (2018) present a recalibration approach that scales a model’s predictive quantiles to satisfy the observed data’s distribution. In Vovk et al. (2020), a similar approach is presented, where interpolation between scaling factors is randomized, and a theoretical analysis is provided. An extension of both Kuleshov et al. (2018) and Vovk et al. (2020) and other recalibration is proposed in Marx et al. (2022), along with corresponding theoretical guarantees. While these methods have been shown to yield well-calibrated models, the resulting predictive quantiles are potentially much too crude, resulting in predictions that perform poorly in terms of sharpness, i.e., the corresponding confidence intervals will overestimate the model error by a very large margin. To remedy this, Song et al. (2019) and Kuleshov & Deshpande (2022) propose optimizing the parameters of a recalibration model by obtaining calibration on a distribution level. However, while Song et al. (2019) relies on complex approximations and provides no theoretical guarantees, Kuleshov & Deshpande (2022) does not allow to optimize for sharpness directly, and calibration is only guaranteed asymptotically as the number of data grows.
3 Problem Statement
Consider a compact input space , and output space , and an unknown data distribution on . Consider a model conditioned on training data that, for every and confidence level , returns a base prediction and an additive cut-point term , where and is potentially negative, such that corresponds to a predictive -quantile. The model is then said to be calibrated if
[TABLE]
holds for every . Furthermore, the calibrated model is also said to be sharp if the corresponding predictive distributions are concentrated (Gneiting et al., 2007), i.e., if the centered confidence intervals induced by the predictive quantiles
[TABLE]
are as small as possible for every . Our goal is to find a sharply calibrated model based on GP regression.
4 Gaussian Process Regression
In this section, we briefly review GP regression, with particular focus on the choice and influence of hyperparameters.
A GP is formally defined as a collection of random variables, any finite subset of which is jointly Gaussian (Rasmussen & Williams, 2006). It is fully specified by a prior mean function, which we set to zero without loss of generality, and a hyperparameter-dependent covariance function, called kernel , where denotes the hyperparameter space. The core concept behind GP regression lies in assuming that any finite number of measurements of an unknown function at arbitrary inputs are jointly Gaussian with mean zero and covariance , where
[TABLE]
consists of kernel evaluations at the test inputs.
Given a set of noisy observations , where and is iid Gaussian measurement noise, we can condition the GP on them to obtain the posterior distribution. The posterior distribution for a new observation at an arbitrary input is again Gaussian distributed, with mean and variance
[TABLE]
with , , and denoting the identity matrix. In this paper, we restrict ourselves to kernels that yield a posterior variance that is monotonically increasing with respect to the hyperparameters , as specified in the following assumption.
Assumption 4.1**.**
The posterior variance is a continuous function of . Furthermore, for all hyperparameters with , it holds that .
4.1 holds trivially for the signal variance of a kernel. Moreover, it holds for any hyperparameters that lead to a monotonous increase in the Fourier transform, which is the case, e.g., for the inverse lengtschale of stationary kernels up to a multiplicative factor corresponding to the ratio of the lengthscales (Capone et al., 2022). Furthermore, several results that employ the so-called fill-distance indicate that 4.1 holds for the inverse lengthscale of a broad class of stationary kernels as the lengthscale becomes very large or very small (Wendland, 2004). In our experiments, we observed that 4.1 was never violated for the inverse lengthscale of the squared-exponential kernel. 4.1 will be leveraged to define a cumulative density function by also changing the hyperparameters corresponding to the posterior covariance, as opposed to simply scaling it.
Arguably one of the most challenging aspects of GP regression lies in the choice of hyperparameters , as they ultimately determine various characteristics of the posterior, e.g., smoothness and amplitude. In practice, the most common way of choosing is by maximizing the log marginal likelihood
[TABLE]
In terms of posterior mean quality, i.e., predictive performance of , choosing the hyperparameters in this manner is often the most promising option, since it seeks a trade-off between model complexity and data fit, and has repeatedly been shown to yield a satisfactory mean square error when applied to test data (Rasmussen & Williams, 2006). However, even when employing log-likelihood maximization, the posterior GP distribution is seldom well calibrated, and, in practice, the data often has a significantly different distribution. As a result, quantiles obtained with a purely Bayesian approach are either too confident or too conservative in practice (Capone et al., 2022; Fong & Holmes, 2020). Furthermore, the restrictions imposed by the GP prior often produce a posterior variance that is grossly conservative. See Figure 1 for an illustration.
5 Proposed Approach
In this section, we present our approach to obtaining calibrated GPs by discarding the posterior variance and obtaining alternative predictive quantiles using a quantity inspired by the posterior variance and new hyperparmeters. As we will demonstrate, our approach has numerous advantages. First, by varying the hyperparameters used to obtain predictive quantiles, the resulting quantiles are sharper than what can be obtained simply by multiplying the posterior variance with an appropriate constant. Secondly, by exploiting the monotonicity of the hyperparameters, our approach can be used to obtain intervals for multiple confidence levels in a very efficient manner. Finally, our method is backed by tight theoretical guarantees, obtained by exploiting its connection to conformal prediction.
5.1 Sharp Calibrated GP for Single Confidence Level
In the following, we describe how to obtain a sharply calibrated GP for a fixed desired calibration level . Instead of scaling the GP posterior variance to meet the desired calibration level , we propose computing predictive quantiles by training a new quantity similar to the posterior variance but with new hyperparameters. In other words, we discard the posterior variance of the first GP and replace it by a quantity that corresponds to the posterior variance of a different GP, which we train separately. This way, we allow for more degrees of freedom during calibration. We then leverage this additional freedom to minimize the distance of the quantiles to the predictive mean, which yields a sharply calibrated model.
We assume to have a data set , which we split into training data and calibration (holdout) data , with . The training data will be used to compute the posterior, whereas will be used to calibrate the model. Note that while not splitting the data might be reasonable when using other types of regression models, e.g., Bayesian or ensemble neural networks, splitting the data in the case of GPs is beneficial for providing accurate quantiles for data out of distribution. This is because, for many commonly used kernels, the GP posterior distribution is considerably more concentrated for test points close to the training data than for those far away from . Since we wish to obtain a model that is calibrated for data both close and far away from the training data, we take this into account during training by splitting the data. We also assume to have a predictive mean , corresponding to GP posterior mean function, where the regressor hyperparameters were obtained, e.g., via log-likelihood maximization, as discussed in Section 4. However, note that any other way of choosing the posterior mean hyperparameters is permitted.
Given and , we then follow the convention of other recalibration approaches (Kuleshov et al., 2018; Marx et al., 2022) and aim to obtain, for an arbitrary , a scalar and vector of hyperparameters , such that the corresponding predictive quantile contains times the total amount of data points. For this reason, we henceforth refer to as calibration hyperparameters.
In order to obtain a sharply calibrated GP model, ideally we would like to choose and such that they minimize the expected length of the centered intervals (2) subject to calibration. However, this optimization problem is hard to solve. Hence, we instead attempt to improve model sharpness by concentrating the predictive distribution around the predictive mean , i.e., by minimizing the deviation of the quantiles from . This corresponds to solving the optimization problem
[TABLE]
where are samples from the calibration data set, corresponds to the difference between predicted and measured output, and to the number of data points used for calibration.
Remark 5.1*.*
Note that we employ in the denominator in (5) instead of . Though this choice makes little difference in practice, we require it for theoretical guarantees.
The equality constraint in (5) is generally infeasible, e.g., if and , and is discontinuous, making it hard to solve. As it turns out, this can be easily remedied without any detriment to sharpness or calibration, and the problem can be rendered considerably easier to solve by substituting the equality constraint in (5) with
[TABLE]
where is a monotonically increasing piecewise linear function111 For , a permutation where for all , and ,
that maps to the -th smallest entry of , and the entries of the vector
[TABLE]
correspond to the z-scores of the data under the calibration standard deviation . The original problem (5) then becomes
[TABLE]
which is considerably easier to solve due to the lack of constraints, and enables us to use gradient-based approaches, since is differentiable with respect to .
Remark 5.2*.*
The choice of interpolant (6) is due to its simplicity. However, other forms of monotone interpolation are also possible. For high data sizes, the choice of interpolant becomes of little relevance, since we only perform small interpolation steps.
5.2 Calibrated GP for Arbitrary Confidence Level
While (7) is useful for obtaining a sharply calibrated model for a single confidence level , solving (7) multiple times whenever we want to obtain sharply calibrated models for different confidence levels can be time-consuming. Furthermore, interpolating between any two arbitrary solutions of (7) won’t necessarily yield a result close to the desired calibration. Fortunately, we can leverage 4.1 to show that interpolating between two solutions of (7) will yield a result close to the desired calibration, provided that we interpolate between two strictly increasing or decreasing sets of hyperparameters. Formally, this is achieved by solving (7) times to obtain and , subject to two additional constraints. First, the calibration scaling parameters must be monotonically increasing with , i.e.,
[TABLE]
and the calibration hyperparameters must be decreasing with if is negative, and increasing if is positive, i.e.,
[TABLE]
[TABLE]
In other words, the entries of are strictly decreasing with up until the sign of switches, after which they are increasing. The reason why we impose these restrictions is that we can then confidently interpolate between any values of and , since 4.1 implies that the quantile stipulated by is monotonically increasing with . We then train simple piecewise linear interpolation models and , such that and , with the additional constraint that reaches a minimum whenever , which can be potentially achieved by adding an artificial vector of training hyperparameters for computing . Note, however, that any other form of monotone interpolation is also acceptable for obtaining and . The procedure is summarized in Algorithm 1.
Remark 5.3*.*
While lengthscale constraints of the form can be easily enforced when solving (7) by substituting with and minimizing over the logarithm of , the constraint is not enforced in (7). However, in practice we were often able to find local minima of (7) that satisfy this requirement.
We can then easily show that our approach achieves an arbitrary calibration level as the amount of data grows, provided that we choose the confidence levels accordingly.
Theorem 5.4**.**
Let be absolutely continuous conditioned on , let be a posterior GP mean, and let be a GP posterior variance conditioned on . Then, for any calibration data set , choose
[TABLE]
*and let and be interpolation models obtained with Algorithm 1 and confidence levels . Then *
[TABLE]
Proof.
The proof can be found in the supplementary material. ∎
Note that Theorem 5.4 also implies that a single set of calibration parameters and obtained by solving (7), since we can substitute and into (8).
6 Discussion
Computational Complexity.
Much like hyperparameter optimization for standard GPs (Rasmussen & Williams, 2006), the major driver behind the computational complexity in our approach stems from the need to invert the covariance matrix, an operation that scales cubically with the amount of data. In order to alleviate the computational cost of our approach, we can resort to different tools that improve scalability (Liu et al., 2020). One approach consists of employing only a subset of the training data to choose the calibration hyperparameters , and then the full data set to choose . While this potentially leads to a loss in sharpness compared to when using the full data set, it still guarantees a calibrated model. Our technique is also readily applicable to sparse GPs (Snelson & Ghahramani, 2005; Titsias, 2009), as 4.1 typically still holds. This option is also explored in numerical experiments, in Section 7. Moreover, in many settings a specific level of calibration is often required, as opposed to several different ones, e.g., in stochastic model predictive control, where chance constraints corresponding to a fixed risk have to be satisfied (Mesbah, 2016). In such settings, we potentially only have to train a single vector of calibration hyperparameters , which reduces computational cost.
Initialization and Solution.
The choice of initial hyperparameters can affect the optimization results considerably, and choosing a good hyperparameter initialization can be challenging, as is true when choosing the hyperparameters for the predictive mean. While this can be partially addressed by employing random restarts, we can also reuse trained models for similar calibration levels, since it is reasonable to expect that only small changes to the calibration hyperparameters are required to achieve a slight increase or decrease in confidence level. Furthermore, we can also simplify the problem by considering only a scaled version of the regression hyperparameters to compute , which would reduce the optimization problem to a line search.
7 Experiments
In this section, we apply and analyze our approach using a toy data set and different regression benchmark data sets from the UCI repository. In the supplementary material, we also compare our approach to that of Capone et al. (2022) when used to obtain uniform error bounds and apply our method to two different Bayesian optimization problems.
The goal is for our approach to obtain a sharp calibrated regression model for each data set in the calibrated regression experiments. We test our approach on various data sets and compare it to the state-of-the-art recalibration approaches by Kuleshov et al. (2018) and Vovk et al. (2020), the point predictor (posterior variance-free) approach proposed in Marx et al. (2022), as well as the check-score-based approach of Kuleshov & Deshpande (2022). The technique proposed by Kuleshov et al. (2018) essentially multiplies the vanilla posterior standard deviation with the recalibrated z-score, such that the confidence level observed on the calibration data matches that of the desired confidence level. Vovk et al. (2020) employ a similar approach, except that random interpolation is employed to compute new scaling values. The point predictor-based method proposed in Marx et al. (2022) discards the posterior standard deviation and computes a constant scalar that is added to the predictive mean and used to compute quantiles everywhere within the input space. The method of Kuleshov & Deshpande (2022) trains a neural network using a quantile loss, which takes base quantiles as inputs and returns new, recalibrated quantiles. As a recalibrator for the method of Kuleshov & Deshpande (2022), we employ the same neural network architecture suggested in their paper, trained over epochs, with additional pretraining over epochs using a single dataset for the UCI experiments. In all experiments except kin8nm and Facebook comment volume 2, we employ standard GPs with automatic relevance determination squared-exponential (ARD-SE) kernels and zero prior mean as base models, trained using log-likelihood maximization. For the kin8nm and Facebook comment volume 2 datasets, we employ sparse GPs (Titsias, 2009) with zero prior mean, ARD-SE kernels, and inducing points.
7.1 Toy Data Set
The first regression data set corresponds to a one-dimensional synthetic data set, where the results can be easily displayed visually. The main purpose of this section is to give an intuition as to how our approach computes confidence intervals compared to other techniques. We investigate the performance of our approach and compare it to other methods when employed to compute centered confidence intervals. We observe that the confidence intervals obtained with our approach peak less strongly far away from the data while being tight near the data compared to all other approaches except the one of Kuleshov & Deshpande (2022). This is because we allow the lengthscale to change to obtain a calibrated model. In contrast, all other methods except that of Kuleshov & Deshpande (2022) scale the standard GP posterior variance without changing hyperparameters. The method of Kuleshov & Deshpande (2022), which uses a neural network as a recalibrator, offers additional flexibility, resulting in sharper confidence intervals. However, calibration is not explicitly enforced during training, resulting in poor calibration. The results are depicted in Figure 2.
7.2 Benchmark Data Sets
We now experiment with seven different regression data sets from the UCI repository, two containing over eight thousand data points and requiring sparse GP approximations. The training/calibration/test split is , , and for all data sets except the Facebook comment volume 2 data set, which contains over data points, and where the split is , , and . For the approach of Kuleshov & Deshpande (2022), we follow the steps in their paper and limit the calibration data size to .
We assess performance by employing diagnostic tools commonly used to assess calibration and sharpness (Kuleshov et al., 2018; Marx et al., 2022; Gneiting et al., 2007)). The score used to quantify calibration is the calibration error (Kuleshov et al., 2018), given by
[TABLE]
where corresponds to the -th desired confidence level, chosen, e.g., evenly spaced between [math] and , and is the observed confidence level, i.e.,
[TABLE]
Here the superscript denotes test inputs and outputs, denotes the total number of test points, and . We employ evenly spaced values between [math] and for . To measure sharpness, we employ the average length of the confidence interval, the average standard deviation of the predictive distribution, and the average negative log-likelihood of the predictions (Gneiting et al., 2007; Marx et al., 2022). Note that since every model outputs a quantile for any desired calibration level, the corresponding negative log-likelihood and average standard deviation are well specified. These are computed by employing the cumulative distribution function, obtained by inverting the quantile function specified by each model.
We carried out each experiment times and report the resulting average expected calibration error, standard deviation, negative log-likelihood, and length of the centered 95 confidence intervals in Table 1. Our approach performs best or marginally worse than all other calibration approaches regarding expected calibration error. This is to be expected from Theorem 5.4. Furthermore, it outperforms all approaches except that of Kuleshov & Deshpande (2022) in sharpness. However, the improved sharpness of the method of Kuleshov & Deshpande (2022) comes at the expense of calibration.
8 Conclusion
We have presented a calibration method for Gaussian process regression that leverages the monotonicity properties of the kernel hyperparameters to obtain sharp calibrated models. We show that, under reasonable assumptions, our method yields an accurately calibrated model as the size of data used for calibration increases. When applied to different regression benchmark data sets, our approach was shown to be competitive in sharpness compared to state-of-the-art recalibration methods without sacrificing calibration performance. It is worth stressing that, though the tools presented here emerge naturally from a Gaussian process setting, we do not require our predictor to be a Gaussian process to obtain theoretical guarantees. In future work, we aim to leverage similar monotonicity characteristics to get sharply calibrated models using tools different from Gaussian processes. Furthermore, we aim to experiment with inducing variables as hyperparameters when optimizing the models for sharpness.
Acknowledgements
This work was supported in part by the European Research Council Consolidator Grant Safe data-driven control for human-centric systems (CO-MAN) under grant agreement number 864686.
Proof of Theorem 5.4
For completeness, we state Theorem 1 from Marx et al. (2022) here in adapted form, which we then use to prove Theorem 5.4.
Lemma 8.1** (Marx et al. (2022), Theorem 1).**
Let be a function such that that , is an absolutely continuous random variable and, for any fixed , is strictly monotonically increasing. Furthermore, for a set of calibration data with and a permutation such that
[TABLE]
let be a monotonically non-decreasing function, such that holds for all . Then
[TABLE]
The idea behind the proof of Theorem 5.4 is to show that the solution of the implicit equation
[TABLE]
satisfies the requirements stipulated by Lemma 8.1, where and are arbitrary continuous functions such that
[TABLE]
[TABLE]
is monotonically increasing with respect to for all , and monotonically decreasing with respect to for all . Note that the functions and can be easily extended within the real axis to satisfy the requirements mentioned above, which means that they are contained within the set from which and . The reason why we choose arbitrary and , as opposed to the functions and , is because we need to be independent of the calibration data in order to be able to employ Lemma 8.1. Showing that satisfies the requirements of Lemma 8.1 for any and then implies that we can also choose any function within this class that minimizes sharpness, meaning that these properties also extend to and .
To prove Theorem 5.4, we will require the following result.
Lemma 8.2**.**
Consider the regressor , and let and be functions that satisfy (11) and (12). Then, for arbitrary fixed and ,
[TABLE]
is strictly monotonically decreasing with .
Proof. The proof follows directly from 4.1 and the properties (11) and (12). ∎
Proof of Theorem 5.4. Let and be functions that satisfy (11) and (12). Due to Lemma 8.2, we can define the function as the unique solution to the implicit equation
[TABLE]
Note that, since is strictly monotonically increasing with , is a strictly monotonically increasing function of for any fixed . Furthermore, since is absolutely continuous,
[TABLE]
holds for all almost surely, which implies for all almost surely. Hence, , , corresponds to an absolutely continuous random variable. Hence, given any monotonically non-decreasing function that satisfies the requirement
[TABLE]
Lemma 8.1 implies that
[TABLE]
Since and are arbitrary, and and are continuous and also satisfy (11) and (12) within , we can substitute in (15) with , which is the unique solution of the implicit equation
[TABLE]
Now, in the particular case of , due to (7), we have that
[TABLE]
meaning that , i.e., (15) holds for and the identity function . Furthermore, since is uniquely defined by the implicit equation (16) and is monotonically increasing with , this in turn implies
[TABLE]
Since and are arbitrary, and and which, together with (15), implies the desired result. ∎
Comparison with Capone et al. (2022)
In this section, we briefly examine how our approach compares to that of Capone et al. (2022) when used to compute uniform error bounds, i.e., percent credible intervals, for three different data sets. We carried out each experiment times. In the following, we report the rate of uniform error bound violation and the average length of the percent credible intervals. The method of Capone et al. (2022) is purely Bayesian and thus heavily dependent on the prior. The resulting credible intervals are well-calibrated, i.e., they cover most of the data. However, our approach is much better regarding sharpness. This is because Capone et al. (2022) is Bayesian and requires symmetric intervals, whereas our approach is frequentist and allows for asymmetric credible intervals. Our approach also exhibits a lower rate of uniform error bound violations than Capone et al. (2022) in most cases, which suggests that a frequentist approach is more adequate for computing uniform error bounds than a Bayesian one.
Bayesian Optimization
We now investigate how the proposed calibration approach can be employed in a Bayesian optimization context using two commonly used benchmark functions, the Ackley and Rosenbrock functions.
In Bayesian optimization, the goal is to find a point in input space that maximizes an unknown function . In particular, we investigate how our calibrated GP bound performs when used as an upper confidence bound (UCB) for a GP-UCB type acquisition function. Simply put, given a data set of size , the GP-UCB algorithm chooses a query point by maximizing the acquisition function
[TABLE]
where is a tuning parameter that stipulates the trade-off between exploration and exploitation, and may or may not depend on the data set . It has been shown that if the unknown function belongs to the RKHS associated with the kernel , and is chosen sufficiently large, then the GP-UCB achieves sublinear regret (Chowdhury & Gopalan, 2017). However, both assumptions typically cannot be verified in practice, and choosing both the kernel and the scaling factor in a principled manner remains an open problem. We propose employing the modified acquisition function
[TABLE]
where the hyperparameters are obtained via a calibrated model and a suitable choice of confidence parameter . In the experiments, we set and compute the calibrated hyperparameters by setting , meaning that we set expect only one percent of the evaluations to lie outside the confidence region. Note that even though the underlying function is fixed, it is reasonable to expect that some of the data lies outside the confidence region due to noise, and we can only expect the data to lie fully within the confidence region in the noiseless case, which we do not consider in this paper. Furthermore, we refrain from retraining the hyperparameters after each data point is collected, following the convention of other Bayesian optimization approaches (Srinivas et al., 2012; Chowdhury & Gopalan, 2017). While this does not enable us to employ the theoretical guarantees developed in Section 5, it reduces computational time significantly. We additionally compare our results to the vanilla UCB algorithm, where the hyperparameters, chosen via log-likelihood maximization, are identical for both the posterior mean and variance, and we set .
We evaluate the results both in terms of cumulative regret and simple regret. Cumulative regret after steps corresponds to the metric
[TABLE]
whereas simple regret is given by
[TABLE]
Typically, a Bayesian optimization algorithm is deemed useful if cumulative regret exhibits sublinear growth, implying that the average regret goes to zero. Simple regret, by contrast, corresponds to the best query among all past queries and is an important metric whenever evaluation costs are low (Berkenkamp et al., 2019).
In the case of the Ackley experiment, our approach typically chose lengthscales that were smaller than those computed via likelihood maximization. This results in more exhaustive exploration than vanilla UCB, which in turn means that local minima are explored more carefully before the focus of the optimization is shifted elsewhere. This results in better performance than when using vanilla UCB, both in terms of cumulative and simple regret. The results correspond to the top two figures in Figure 3.
In contrast to the Ackley experiment, in the Rosenbrock experiment our approach selects lengthscales that are larger than those suggested by the likelihood maximum. Roughly speaking, this means that the confidence intervals produced by the likelihood maximum hyperparameters are too conservative, and our approach attempts to compensate for this by indicating more confidence in the posterior mean obtained with the vanilla GP. This means that local minima are explored less meticulously than with the vanilla UCB algorithm. This choice is justified by the cumulative regret obtained with our approach, as it is slightly smaller than that obtained by the vanilla UCB algorithm. However, this also results in worse simple regret than the vanilla UCB algorithm, which is intuitive, as our approach opts to explore local minima less accurately than the vanilla UCB algorithm. We also note that both algorithms converge towards the same simple regret as the number of iterations increases. The results correspond to the bottom two figures in Figure 3.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Berkenkamp et al. (2019) Berkenkamp, F., Schoellig, A. P., and Krause, A. No-regret bayesian optimization with unknown hyperparameters. Journal of Machine Learning Research , 20(50):1–24, 2019. URL http://jmlr.org/papers/v 20/18-213.html .
- 2Capone et al. (2022) Capone, A., Lederer, A., and Hirche, S. Gaussian process uniform error bounds with unknown hyperparameters for safety-critical applications. In Proceedings of the 39th International Conference on Machine Learning , volume 162 of Proceedings of Machine Learning Research , pp. 2609–2624. PMLR, 2022. URL https://proceedings.mlr.press/v 162/capone 22a.html .
- 3Chowdhury & Gopalan (2017) Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research , pp. 844–853, International Convention Centre, Sydney, Australia, 2017. PMLR. URL http://proceedings.mlr.press/v 70/chowdhury 17a.html .
- 4Deisenroth et al. (2015) Deisenroth, M. P., Fox, D., and Rasmussen, C. E. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence , 37(2):408–423, 2015.
- 5Deringer et al. (2021) Deringer, V. L., Bartók, A. P., Bernstein, N., Wilkins, D. M., Ceriotti, M., and Csányi, G. Gaussian process regression for materials and molecules. Chemical Reviews , 121(16):10073–10141, 2021. doi: 10.1021/acs.chemrev.1c 00022 . URL https://doi.org/10.1021/acs.chemrev.1c 00022 . PMID: 34398616. · doi ↗
- 6Fong & Holmes (2020) Fong, E. and Holmes, C. C. On the marginal likelihood and cross-validation. Biometrika , 107(2):489–496, 2020. doi: 10.1093/biomet/asz 077 . URL https://doi.org/10.1093/biomet/asz 077 . · doi ↗
- 7Gal et al. (2017) Gal, Y., Hron, J., and Kendall, A. Concrete dropout. In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/84ddfb 34126 fc 3a 48ee 38d 7044 e 87276-Paper.pdf .
- 8Gneiting et al. (2007) Gneiting, T., Balabdaoui, F., and Raftery, A. E. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 69(2):243–268, 2007.
