Maximizing Overall Diversity for Improved Uncertainty Estimates in Deep Ensembles
Siddhartha Jain, Ge Liu, Jonas Mueller, David Gifford

TL;DR
This paper introduces Maximize Overall Diversity (MOD), a simple method to enhance uncertainty estimates in neural network ensembles by increasing their diversity, leading to better out-of-distribution predictions and Bayesian optimization performance.
Contribution
MOD is a novel approach that encourages larger overall diversity in ensemble predictions, significantly improving uncertainty estimation without harming in-distribution accuracy.
Findings
Improves out-of-distribution prediction accuracy across multiple datasets.
Enhances Bayesian optimization performance using MOD-based uncertainty estimates.
Does not compromise in-distribution performance.
Abstract
The inaccuracy of neural network models on inputs that do not stem from the training data distribution is both problematic and at times unrecognized. Model uncertainty estimation can address this issue, where uncertainty estimates are often based on the variation in predictions produced by a diverse ensemble of models applied to the same input. Here we describe Maximize Overall Diversity (MOD), a straightforward approach to improve ensemble-based uncertainty estimates by encouraging larger overall diversity in ensemble predictions across all possible inputs that might be encountered in the future. When applied to various neural network ensembles, MOD significantly improves predictive performance for out-of-distribution test examples without sacrificing in-distribution performance on 38 Protein-DNA binding regression datasets, 9 UCI datasets, and the IMDB-Wiki image dataset. Across many…
| (MOD out-performance) | (MOD out-performance) | |||||
| Methods | Out-of-distribution NLL | # of TFs | p-value | In-distribution NLL | # of TFs | p-value |
| (OOD as sequences with top 10% binding affinity) | ||||||
| DeepEns | 0.74850.124 | 26 | 1.7e-05 | -0.42660.031 | 32 | 7.7e-09 |
| DeepEns+AT | 0.74380.122 | 25 | 0.001 | -0.43120.033 | 26 | 0.005 |
| NegCorr | 0.73580.118 | 27 | 0.061 | -0.43140.032 | 17 | 0.761 |
| MOD | 0.71530.117 | -0.43120.031 | ||||
| MOD-R | 0.72250.116 | 22 | 0.359 | -0.43250.032 | 16 | 0.777 |
| MOD-in | 0.73260.121 | 26 | 0.012 | -0.43170.032 | 19 | 0.535 |
| (OOD as sequences with 80% GC content) | ||||||
| DeepEns | -0.69380.052 | 20 | 0.022 | -0.56490.029 | 34 | 3.1e-11 |
| DeepEns+AT | -0.70100.041 | 23 | 0.007 | -0.57400.027 | 21 | 0.292 |
| NegCorr | -0.68050.065 | 25 | 0.011 | -0.57000.026 | 25 | 0.017 |
| MOD | -0.70070.047 | -0.57290.027 | ||||
| MOD-R | -0.69590.040 | 24 | 0.004 | -0.57200.027 | 22 | 0.357 |
| MOD-in | -0.69480.054 | 21 | 0.103 | -0.57110.028 | 22 | 0.163 |
| vs | DeepEns | DeepEns+AT | NegCorr |
|---|---|---|---|
| MOD-in | 21 (0.111) | 21 (0.041) | 19 (0.356) |
| MOD | 26 (0.003) | 24 (0.004) | 20 (0.001) |
| MOD-R | 22 (0.019) | 23 (0.007) | 22 (0.017) |
| vs | MOD-in | MOD | MOD-R |
| MOD-in | 17 (0.791) | 16 (0.51) | |
| MOD | 19 (0.002) | 22 (0.173) | |
| MOD-R | 20 (0.052) | 14 (0.674) |
| Datasets | DeepEns | DeepEns+AT | NegCorr | MOD | MOD-R | MOD-in | MOD-Adv |
|---|---|---|---|---|---|---|---|
| Out-of-distribution NLL | |||||||
| concrete | -0.8310.237 | -0.9150.204 | -0.9130.277 | -0.9040.118 | -0.9100.193 | -0.9240.188 | -0.9500.200 |
| yacht | -1.5970.840 | -1.7620.647 | -1.9720.570 | -1.7970.437 | -1.7610.578 | -1.6380.663 | -1.9480.343 |
| naval-propulsion-plant | -2.5800.103 | -1.3800.087 | -2.6180.056 | -2.7290.071 | -2.1300.069 | -2.0570.055 | -2.6290.068 |
| wine-quality-red | 0.1330.132 | 0.1150.086 | 0.1130.104 | 0.1530.107 | 0.0840.072 | 0.0850.065 | 0.2170.114 |
| power-plant | -1.7340.054 | -1.7310.088 | -1.6590.075 | -1.6380.151 | -1.6440.120 | -1.7310.050 | -1.6690.066 |
| protein-tertiary-structure | 1.1620.231 | 1.1780.158 | 1.2310.130 | 1.1970.137 | 1.1940.214 | 1.1540.132 | 1.2990.252 |
| kin8nm | -1.9800.053 | -1.9700.093 | -2.0360.046 | -1.9990.049 | -2.0030.095 | -1.9930.078 | -2.0270.085 |
| bostonHousing | 1.5910.680 | 1.2430.690 | 1.8210.913 | 0.5680.959 | 0.4600.648 | 0.9230.733 | 1.5170.711 |
| energy | -1.5900.253 | -1.7840.153 | -1.7180.193 | -1.7360.117 | -1.7410.264 | -1.7330.199 | -1.7720.242 |
| MOD outperformance p-value | 0.002 | 4.9e-07 | 0.034 | - | 1.6e-04 | 4.6e-05 | 0.027 |
| In-distribution NLL | |||||||
| concrete | -1.0750.094 | -1.1290.084 | -1.0890.102 | -1.1550.086 | -1.1370.132 | -1.0900.092 | -1.0470.177 |
| yacht | -3.2860.692 | -3.2450.822 | -3.5700.166 | -3.5000.190 | -3.4610.252 | -3.3390.815 | -3.5560.203 |
| naval-propulsion-plant | -2.7350.077 | -1.5130.042 | -2.8100.042 | -2.8570.067 | -2.2970.061 | -2.2380.047 | -2.8170.046 |
| wine-quality-red | -0.0700.853 | -0.3410.068 | -0.2660.291 | -0.3370.069 | -0.3480.045 | -0.3510.055 | -0.1700.505 |
| power-plant | -1.5210.015 | -1.5250.018 | -1.5240.023 | -1.5230.012 | -1.5220.017 | -1.5240.013 | -1.5230.016 |
| protein-tertiary-structure | -0.5140.013 | -0.5190.007 | -0.5440.012 | -0.5330.009 | -0.5320.012 | -0.5290.008 | -0.5400.012 |
| kin8nm | -1.3050.016 | -1.3150.020 | -1.3340.015 | -1.3170.019 | -1.3150.017 | -1.3220.015 | -1.3140.020 |
| bostonHousing | -0.9010.154 | -0.9370.144 | -0.6560.671 | -0.9530.147 | -0.8830.188 | -0.9250.180 | -0.7280.376 |
| energy | -2.4260.151 | -2.5170.098 | -2.6200.130 | -2.5070.153 | -2.5250.098 | -2.5220.129 | -2.6380.137 |
| MOD outperformance p-value | 2.4e-08 | 6.9e-11 | 0.116 | - | 8.9e-06 | 2.2e-06 | 0.046 |
| Methods | OOD NLL | In-Dist NLL |
|---|---|---|
| DeepEns | 1.3100 0.2486 | -0.2193 0.0207 |
| DeepEns+AT | 1.23480.1291 | -0.24190.0213 |
| NegCorr | 1.17310.1978 | -0.22860.0179 |
| MOD-in | 1.26250.1961 | -0.23010.0128 |
| MOD | 1.12940.1707 | -0.23060.0148 |
| MOD-R | 1.18470.2442 | -0.22850.0191 |
| MOD-Adv | 1.15470.1865 | -0.23050.0149 |
| (MOD out-performance) | (MOD out-performance) | |||||
| Methods | Out-of-distribution RMSE | # of TFs | p-value | In-distribution RMSE | # of TFs | p-value |
| (OOD as sequences with top 10% binding affinity) | ||||||
| DeepEns | 0.28370.011 | 26 | 1.3e-04 | 0.15910.005 | 34 | 2.1e-12 |
| DeepEns+AT | 0.28120.011 | 23 | 0.087 | 0.15820.005 | 28 | 3.6e-05 |
| NegCorr | 0.28140.010 | 20 | 0.124 | 0.15830.005 | 20 | 0.492 |
| MOD | 0.28020.010 | 0 | 0.0e+00 | 0.15810.005 | 0 | 0.0e+00 |
| MOD-R | 0.27950.010 | 17 | 0.933 | 0.15790.005 | 13 | 0.969 |
| MOD-in | 0.28010.010 | 16 | 0.617 | 0.15810.005 | 18 | 0.633 |
| (OOD as sequences with 80% GC content) | ||||||
| DeepEns | 0.11900.004 | 25 | 0.008 | 0.14150.003 | 36 | 8.5e-24 |
| DeepEns+AT | 0.11800.003 | 19 | 0.029 | 0.13940.003 | 22 | 0.106 |
| NegCorr | 0.11790.004 | 21 | 0.052 | 0.14030.003 | 30 | 4.7e-08 |
| MOD | 0.11730.003 | 0 | 0.0e+00 | 0.13940.003 | 0 | 0.0e+00 |
| MOD-R | 0.11770.003 | 22 | 0.112 | 0.13980.003 | 19 | 0.079 |
| MOD-in | 0.11770.004 | 21 | 0.401 | 0.14030.003 | 27 | 2.8e-06 |
| Datasets | DeepEns | DeepEns+AT | NegCorr | MOD | MOD-R | MOD-in | MOD-Adv |
|---|---|---|---|---|---|---|---|
| Out-of-distribution RMSE | |||||||
| concrete | 0.1050.016 | 0.1020.011 | 0.0960.012 | 0.1050.014 | 0.1130.028 | 0.0990.011 | 0.0950.012 |
| yacht | 0.0380.041 | 0.0390.044 | 0.0180.005 | 0.0260.005 | 0.0260.007 | 0.0390.045 | 0.0210.005 |
| naval-propulsion-plant | 0.1110.008 | 0.1260.015 | 0.1120.011 | 0.1090.008 | 0.1250.008 | 0.1290.007 | 0.1220.009 |
| wine-quality-red | 0.2390.013 | 0.2360.009 | 0.2340.008 | 0.2390.011 | 0.2340.010 | 0.2350.008 | 0.2420.014 |
| power-plant | 0.0410.002 | 0.0420.004 | 0.0440.004 | 0.0480.012 | 0.0470.010 | 0.0410.003 | 0.0430.003 |
| protein-tertiary-structure | 0.3010.009 | 0.3010.003 | 0.2940.003 | 0.2990.008 | 0.3000.011 | 0.3000.007 | 0.3000.004 |
| kin8nm | 0.0420.005 | 0.0400.005 | 0.0370.003 | 0.0390.003 | 0.0410.005 | 0.0390.004 | 0.0390.004 |
| bostonHousing | 0.2210.023 | 0.2130.021 | 0.2100.016 | 0.2090.017 | 0.2000.014 | 0.2120.016 | 0.2170.022 |
| energy | 0.0600.014 | 0.0540.012 | 0.0530.009 | 0.0530.005 | 0.0540.013 | 0.0540.009 | 0.0470.009 |
| MOD outperformance p-value | 0.028 | 0.102 | 0.989 | 0.0e+00 | 0.090 | 0.017 | 0.124 |
| In-distribution RMSE | |||||||
| concrete | 0.0860.004 | 0.0850.004 | 0.0840.004 | 0.0830.003 | 0.0830.004 | 0.0840.004 | 0.0840.004 |
| yacht | 0.0170.019 | 0.0190.025 | 0.0100.002 | 0.0120.003 | 0.0120.004 | 0.0180.023 | 0.0100.002 |
| naval-propulsion-plant | 0.0800.003 | 0.0890.003 | 0.0790.004 | 0.0770.003 | 0.0880.003 | 0.0910.004 | 0.0850.004 |
| wine-quality-red | 0.1700.005 | 0.1700.005 | 0.1690.004 | 0.1700.003 | 0.1690.004 | 0.1690.005 | 0.1700.004 |
| power-plant | 0.0530.001 | 0.0530.001 | 0.0520.001 | 0.0530.001 | 0.0530.001 | 0.0530.001 | 0.0520.001 |
| protein-tertiary-structure | 0.1640.002 | 0.1640.001 | 0.1610.001 | 0.1630.000 | 0.1630.001 | 0.1630.001 | 0.1640.001 |
| kin8nm | 0.0740.002 | 0.0720.003 | 0.0710.002 | 0.0720.002 | 0.0730.001 | 0.0720.001 | 0.0730.002 |
| bostonHousing | 0.0850.008 | 0.0840.008 | 0.0830.007 | 0.0840.009 | 0.0840.008 | 0.0840.009 | 0.0850.009 |
| energy | 0.0420.004 | 0.0390.003 | 0.0360.004 | 0.0390.004 | 0.0390.003 | 0.0390.003 | 0.0370.004 |
| In-dist RMSEpval Outperformed by MOD | 4.0e-06 | 5.9e-07 | 0.943 | 0.0e+00 | 8.9e-04 | 0.002 | 0.002 |
| Methods | OOD RMSE | In-Dist RMSE |
|---|---|---|
| DeepEns | 0.388 0.021 | 0.196 0.003 |
| DeepEns+AT | 0.378 0.016 | 0.1920.004 |
| NegCorr | 0.376 0.015 | 0.194 0.002 |
| MOD-in | 0.3820.016 | 0.1940.002 |
| MOD | 0.3750.014 | 0.1940.002 |
| MOD-R | 0.3770.019 | 0.1930.002 |
| MOD-Adv | 0.3740.016 | 0.1930.002 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Maximizing Overall Diversity for Improved Uncertainty Estimates
in Deep Ensembles
Siddhartha Jain,*1 Ge Liu,*1 Jonas Mueller,2 David Gifford,1
*The authors contribute equally, 1CSAIL,MIT, 2Amazon Web Services
[email protected], [email protected], [email protected], [email protected]
Abstract
The inaccuracy of neural network models on inputs that do not stem from the distribution underlying the training data is problematic and at times unrecognized. Uncertainty estimates of model predictions are often based on the variation in predictions produced by a diverse ensemble of models applied to the same input. Here we describe Maximize Overall Diversity (MOD), an approach to improve ensemble-based uncertainty estimates by encouraging larger overall diversity in ensemble predictions across all possible inputs. We apply MOD to regression tasks including 38 Protein-DNA binding datasets, 9 UCI datasets, and the IMDB-Wiki image dataset. We also explore variants that utilize adversarial training techniques and data density estimation. For out-of-distribution test examples, MOD significantly improves predictive performance and uncertainty calibration without sacrificing performance on test data drawn from same distribution as the training data. We also find that in Bayesian optimization tasks, the performance of UCB acquisition is improved via MOD uncertainty estimates.
Introduction
Model ensembling provides a simple, yet extremely effective technique for improving the predictive performance of arbitrary supervised learners each trained with empirical risk minimization (ERM) (?; ?). Often, ensembles are utilized not only to improve predictions on test examples stemming from the same underlying distribution as the training data, but also to provide estimates of model uncertainty when learners are presented with out-of-distribution (OOD) examples that may look different than the data encountered during training (?; ?). The widespread success of ensembles crucially relies on the variance-reduction produced by aggregating predictions that are statistically prone to different types of individual errors (?). Thus, prediction improvements are best realized by using a large ensemble with many base models, and a large ensemble is also typically employed to produce stable distributional estimates of model uncertainty (?; ?).
Practical applications of massive neural networks (NN) are commonly limited to small ensembles because of the unwieldy nature of these models (?; ?; ?). Although supervised learning performance may be enhanced by an ensemble comprised of only a few ERM-trained models, the resulting ensemble-based uncertainty estimates can exhibit excessive sampling variability in low-density regions of the underlying training distribution. Consider the example of an ensemble comprised of five models whose predictions just might agree at points far from the training data by chance. Figure 1 depicts an example of this phenomenon, which we refer to as uncertainty collapse, since the resulting ensemble-based uncertainty estimates would indicate these predictions are of high-confidence despite not being supported by any nearby training datapoints.
Unreliable uncertainty estimates are highly undesirable in applications where future input queries may not stem from the same distribution. A shift in input distribution can be caused by sampling bias, covariate shift, and the adaptive experimentation that occurs in bandits, Bayesian optimization (BO), and reinforcement learning (RL) contexts. Here, we propose Maximize Overall Diversity (MOD), a technique to stabilize OOD model uncertainty estimates produced by an ensemble of arbitrary neural networks. The core idea is to consider all possible inputs and encourage as much overall diversity in the corresponding model ensemble outputs as can be tolerated without diminishing the ensemble’s predictive performance. MOD utilizes an auxiliary loss function and data-augmentation strategy that is easily integrated into any existing training procedure.
Related Work
NN ensembles have been previously demonstrated to produce useful uncertainty estimates for sequential experimentation applications in Bayesian optimization and reinforcement learning (?; ?; ?). Proposed methods to improve ensembles include adversarial training to enforce smoothness (?), and maximizing ensemble output diversity over the training data (?). Recent work has proposed regularizers based on augmented out-of-distribution examples, but is primarily specific to classification tasks and non-trivially requires auxiliary generators of OOD examples (?) or existing examples from other classes (?). Another line of related work solely aims at producing better out-of-distribution detectors (?; ?; ?).
Our work seeks to improve uncertainty estimates in regression settings, where OOD data can stem from an arbitrary unknown distribution, and robust prediction on OOD data is desired rather than just detection of OOD examples. We propose a simple technique to regularize ensemble behavior over all possible inputs that does not require training of additional generator. Consideration of all possible inputs has previously been advocated by (?), although not in the context of uncertainty estimation. ? (?) propose a regularizer to ensure an ensemble approximates a valid Bayesian posterior, but their methodology is only applicable to homoskedastic noise unlike ours. ? (?) also aim to control Bayesian NN output-behavior beyond the training distribution, but our methods do not require the Bayesian formulation they impose and can be applied to arbitrary NN ensembles, which are one of the most straightforward methods used for quantifying NN uncertainty (?; ?; ?). ? (?) focus on incorporating distributional uncertainty into uncertainty estimates via an additional prior distribution, whereas our focus is on improving model uncertainty in model ensembles.
Methods
We consider standard regression, assuming continuous target values are generated via with , such that may heteroscedastically depend on feature values . Given a limited training dataset , where specifies the underlying data distribution from which the in-distribution examples in the training data are sampled, our goal is to learn an ensemble of neural networks that accurately models both the underlying function as well as the uncertainty in ensemble estimates of . Of particular concern are scenarios where test examples may stem from a different distribution , which we refer to as out-of-distribution (OOD) examples. As in (?), each network (with parameters ) in our NN ensemble outputs both an estimated mean to predict and an estimated variance to predict , and the per network loss function , is chosen as the negative log-likelihood (NLL) under the Gaussian assumption . While traditional bagging provides different training data to each ensemble member, we simply train each NN using the entire dataset, since the randomness of separate NN-initializations and SGD-training suffice to produce comparable performance to bagging of NN models (?; ?; ?).
Following (?), we estimate (and NLL with respect to the ensemble) by treating the aggregate ensemble output as a single Gaussian distribution . Here, the ensemble-estimate of is given by \widebar{\mu}(x)=\text{mean}\big{(}\{\mu_{m}(x)\}_{m=1}^{M}\big{)}, and the uncertainty in the target value is given by based on noise-level estimate \sigma_{\text{eps}}^{2}(x)=\text{mean}\big{(}\{\sigma_{m}^{2}(x)\}_{m=1}^{M}\big{)} and model uncertainty estimate \sigma_{\text{mod}}^{2}(x)=\text{variance}\big{(}\{\mu_{m}(x)\}_{m=1}^{M}\big{)}. While we focus on Gaussian likelihoods for simplicity, our proposed methodology is applicable to general parametric conditional distributions.
Maximizing Overall Diversity (MOD)
Assuming have been scaled to bounded regions, MOD encourages higher ensemble diversity by introducing an auxiliary loss that is computed over augmented data sampled from another distribution . Like , is also defined over the input feature space, but differs from the underlying training data distribution and instead describes OOD examples that could be encountered at test-time. The underlying population objective we target is
[TABLE]
with as the original supervised learning loss function (e.g. NLL), and a user-specified penalty . Since NLL entails a proper-scoring rule (?), minimizing the above objective with a sufficiently small value of will ensure the ensemble seeks to recover for inputs that lie in the support of the training distribution and otherwise output large model uncertainty for OOD that lie outside this support. As it is difficult in most applications to specify how future OOD examples may look, we aim to ensure our ensemble outputs high uncertainty estimates for any possible by taking the entire input space into consideration. To account for any possible OOD distribution, we simply pick as the uniform distribution over , the bounded region of all possible inputs . This choice is motivated by Theorem 1 below, which states that the uniform distribution most closely approximates all possible OOD distributions in the minimax sense.
Theorem 1
*The uniform distribution equals:
where for discrete , denotes the set of all distributions, and for continuous , is the set of all distributions with density functions that are bounded within some interval .*
Proof For the discrete case with : let have corresponding pmf , so . When is the uniform distribution, the worst case is one that puts all its mass on a single point , which corresponds to . For any non-uniform : there exists where . Thus for which puts all its mass on , we have . The proof for the continuous case is similar.
In practice, we approximate using the average loss over the training data as in ERM, and train each with respect to its contribution to this term independently of the others as in bagging. To approximate , we similarly utilize an empirical average based on augmented examples sampled uniformly throughout the feature space . Uniformly sampling from the input space takes constant time to compute. We expect only a marginal increase in terms of training time since the computation of back-propagation is largely parallelized and thus an increase in minibatch size would only cause an increase in memory consumption rather than computation time. The formal MOD procedure is detailed in Algorithm 1. We advocate selecting as the largest value for which estimates of (on held-out validation data) do not indicate worse predictive performance. This strategy naturally favors smaller values of as the sample size grows, thus resulting in lower model uncertainty estimates (with as when is supported everywhere and our NN are universal approximators).
We also experiment with an alternative choice of being the uniform distribution over the finite training data (i.e. and = 0 otherwise). We call this alternative method MOD-in, and note its similarity to the diversity-encouraging penalty proposed by (?), which is also measured over the training data. Note that MOD in contrast considers to be uniformly distributed over all possible test inputs rather than only the training examples. Maximizing diversity solely over the training data may fail to control ensemble behavior at OOD points that do not lie near any training example, and thus fail to prevent uncertainty collapse.
Maximizing Reweighted Diversity (MOD-R)
Aiming for high-quality OOD uncertainty estimates, we are mostly concerned with regularizing the ensemble-variance around points located in low density regions of the training data distribution. To obtain a simple estimate that intuitively reflects the inverse of the local density of at a particular set of feature values, one can compute the feature-distance to the nearest training data points (?). Under this perspective, we want to encourage greater model uncertainty for the lowest density points that lie furthest from the training data. Commonly used covariance kernels for Gaussian Process regressors (e.g. radial basis functions) explicitly enforce a high amount of uncertainty on points that lie far from the training data. As calculating the distance of each point to the entire training set may be undesirably inefficient for large datasets, we only compute the distance of our augmented data to a current minibatch during training. Specifically, we use these distances to compute the following:
[TABLE]
where are weights for each of the augmented points , and are members of the minibatch that are the nearest neighbors of . Throughout this paper, we use .
The are thus inversely related to a crude density estimate of the training distribution evaluated at each augmented sample . Rather than optimizing the loss which uniformly weights each augmented sample (as done in Algorithm 1), we can instead form a weighted loss computed over the minibatch of augmented samples as: which should increase the model uncertainty for augmented inputs proportionally to their distance from the training data. We call this variant of our methodology with augmented input reweighting MOD-R.
Maximizing Overall Diversity with Adversarial Optimization (MOD-Adv)
We also consider another variant of MOD that utilizes adversarial training techniques. Here, we maximize the variance on relatively over-confident points in out-of-distribution regions, which are likely to comprise worst-case . Specifically, we formulate a maximin optimization for the MOD penalty , and thus the full training objective becomes . We call this variant MOD-Adv. In practice, we obtain the augmented points by taking a single gradient step in the direction of lower variance (, starting from uniformly sampled points. The extra gradient step can double the computation time compared to MOD. The full algorithm is given in Algorithm 1. Note that MOD-Adv is different than the traditional adversarial training in two aspects: first it takes a gradient step with regard to the model uncertainty measurement (the variance of ensemble mean prediction) instead of with regard to the predicted score of another class; second, the adversarial step is taken starting from a uniformly sampled example instead of a training example. We apply MOD-Adv to only regression tasks with continuous features since it is more natural to apply gradient descent on them.
Experiments
Baseline Methods
Here, we evaluate various alternative strategies for improving model ensembles. All strategies are applied to the same base NN ensemble, which is taken to be the Deep Ensembles (DeepEns) model of (?) previously described in Methods.
Deep Ensembles with Adversarial Training (DeepEns+AT)
(?) used this strategy to improve their basic DeepEns model. The idea is to adversarially sample inputs that lie close to the training data but on which the NLL loss is high (assuming they share the same label as their neighboring training example). Then, we include these adversarial points as augmented data when training the ensemble, which smooths the function learned by the ensemble. Starting from training example , we sample augmented datapoint with the labels for assumed to be the same as that for the corresponding . here denotes the NLL loss function, and the values for hyperparameter that we search over include 0.05, 0.1, 0.2.
Negative Correlation (NegCorr)
This method from (?; ?) minimizes the empirical correlation between predictions of different ensemble members over the training data. It adds a penalty to the loss of the form where is the prediction of the th ensemble member and is the mean ensemble prediction. This penalty is weighted by a user-specified penalty , as done in our methodology.
Experiment Details
All experiments were run on Nvidia TitanX 1080 Ti and Nvidia TitanX 2080 Ti GPUs with PyTorch version 1.0. Unless otherwise indicated, all p-values were computed using a single tailed paired t-test per dataset, and the p-values are combined using Fisher’s method to produce an overall p-value across all datasets in a task. All hyperparameters – including learning rate, -regularization, for MOD/Negative Correlation, and adversarial training – were tuned based on validation set NLL. In every regression task, the search for hyperparameter was over the values 0.01, 0.1, 1, 5, 10, 20, 50. For MOD-Adv, we search for over 0.2,1.0,3.0,5.0 for UCI and 0.1,0.5,1 for the image data.
Univariate Regression
We first consider a one-dimensional regression toy dataset that is similar to the one used by (?). We generated training data from the function:
[TABLE]
[TABLE]
Here, the training data only contain samples drawn from two limited-size regions. Using the standard NLL loss as well as the auxiliary MOD penalty, we train a deep ensemble with 4 neural networks of identical architectures consisting of 1-hidden layer with 50 units, ReLU activation, two sigmoid outputs to estimate the mean and variance of , and L2 regularization. To depict the improvement gained by simply adding ensemble members, we also train an ensemble of 120 networks with same architecture. Figure 1 shows the predictions and confidence interval of the ensembles. MOD is able to produce more reliable uncertainty estimates on the lefthand regions that lack training data, whereas standard deep ensembles exhibit uncertainty collapse, even with many networks. MOD also properly inflated the predictive uncertainty in the center region where no training data is found. Using a smaller in MOD ensures the ensemble predictive performance remains strong for in-distribution inputs that lie near the training data and the ensemble exhibits adequate levels of certainty around these points. While the larger value leads to overly conservative uncertainty estimates that are large everywhere, we note the mean of the ensemble predictions remains highly accurate for in-distribution inputs.
Protein Binding Microarray Data
We next study scientific data with discrete features by predicting Protein-DNA binding. This is a collection of 38 different microarray datasets, each of which contains measurements of the binding affinity of a single transcription factor (TF) protein against all possible 8-base DNA sequences (?). We consider each dataset as a separate task with taken to be the binding affinity scaled to the interval [0,1] and the one-hot embedded DNA sequence. As we ignore reverse-complements, there are possible values of .
Regression
We trained a small ensemble of 4 neural networks with the same architecture as in the previous experiments. We consider 2 different OOD test sets, one comprised of the sequences with top 10% -values and the other comprised of the sequences with more than 80% of the position in being G or C (GC-content). For each OOD set, we use the remainder of the sequences as corresponding in-distribution set. We separate them into extremely small training set (300 examples) and validation set (300 examples), and use the rest as in-distribution test set. We compare MOD along with 3 alternative sampling distribution (MOD-in, MOD-R, and MOD-Adv) against the 3 baselines previously mentioned. We search over 0,1e-3,0.01,0.05,0.1 for penalty and 0.01 for learning rate.
Table 1 and Appendix Table 1 shows mean OOD and in-distribution performance across 38 TFs (averaged over 10 runs using random data splits and NN initializations). MOD methods have significantly improved performance on all metrics and OOD setups compared to DeepEns/DeepEns+AT, both in terms of # of TF outperforming and overall p-value and is on par with DeepEns+AT on in-Distribution. The re-weighting scheme (MOD-R) further improved the performance on top 10% -value OOD set up. Figure 2 shows the calibration curve on two of the TFs where the deep ensembles are over-confident on top 10% -value OOD examples. MOD-R and MOD improve the calibration results by significant margin compared to most of the baselines.
Bayesian Optimization
Next, we compared how the MOD, MOD-R, and MOD-in ensembles performed against the DeepEns, DeepEns+AT, and NegCorr ensembles in 38 Bayesian optimization tasks using the same protein binding data (?). For each TF, we performed 30 rounds of DNA-sequence acquisition, acquiring batches of 10 sequences per round in an attempt to maximize binding affinity. We used the upper confidence bound (UCB) as our acquisition function (?), ordering the candidate points via (with UCB coefficient ).
At every acquisition iteration, we randomly held out 10% of the training set as the validation set and chose the penalty (for MOD, MOD-in, MOD-R, and NegCorr) that produced the best validation NLL (out of choices: 0, 5, 10, 20, 40, 80). The stopping epoch is chosen based on the validation NLL not increasing for 10 epochs with an upper limit of 30 epochs. Optimization was done with a learning rate of 0.01, L2 penalty of 0.01 and used the Adam optimizer. For each of the 38 TFs, we performed 20 Bayesian optimization runs with different seed sequences (same seeds used for all the methods) and using 200 points randomly sampled from the bottom 90% of values as are initial training set.
We evaluated on the metric of simple regret (second term in the subtraction quantifies the best point acquired so far and the first term is the global best). The results are presented in Table 2. MOD outperforms all other methods in both number of TFs with better regret and the combined p-value. MOD-R is also strong outperforming all other methods except MOD with respect to which is about equivalent in terms of statistical significance. Figure 3 shows for the TFs OVOL2 and HESX1, a task in which MOD and MOD-R outperform the other methods.
UCI Regression Datasets
We next experimented with 9 real world datasets with continuous inputs in some applicable bounded domain. We follow the experimental setup that (?) and (?) used to evaluate deep ensembles and deep Bayesian regressors. We split off all datapoints whose -values fall in the top as an OOD test set (so datapoints with such large -values are never encountered during training). We simulate the situation where training set is limited and thus used of the data for training and for validation. The remaining data is used as an in-distribution test set. The analysis is repeated for 10 random splits of the data to ensure robustness. We again use an ensemble of 4 fully-connected neural networks with the same architecture as above and the NLL training loss searching over hyperparameter values: L2 penalty , learning rate . We report the negative log-likelihood (NLL) on both in- and out-of-distribution test sets for ensembles trained via different strategies (including MOD-Adv) and examine the calibration curves.
As shown in Table 3, MOD outperforms DeepEns in 6 out of the 9 datasets on OOD NLL, and has significant overall p-value compared to all baselines. MOD-Adv ranks top 1 in OOD NLL in terms of averaged ranks across all datasets, showing better robustness than MOD. The MOD loss lead to higher-quality uncertainties on OOD data while also improving in-distribution performance of DeepEns.
Figure 2 shows the calibration curve on two of the datasets where the basic deep ensembles exhibit over-confidence on OOD data. Note that retaining accurate calibration on OOD data is extremely difficult for most machine learning methods. MOD and MOD-R improve calibration by a significant margin compared to most of the baselines, validating the effectiveness of our MOD procedure.
The selection of is critical for MOD, thus we also examine the effect of the choice of on the in-distribution performance for the 9 UCI and 38 TF binding regression tasks. As shown in Figure 4, generally does not affect or hurt in-distribution NLL until it gets too large at which point it fairly consistently starts hurting it. When is selected properly it may even improve the in-distribution slightly as shown in the previous tables.
Age Prediction from Images
To demonstrate the effectiveness of MOD, MOD-Adv, and MOD-R on high dimensional data, we consider supervised learning with image data. Here, we use a dataset of human images collected from IMDB and Wikimedia and annotated with age and gender information (?). The IMDB/Wiki parts of the dataset consist of 460K+/62K+ images respectively. 28,601 images in the Wiki dataset are males and the rest are females.
In the context of Wiki images, we tried to predict the ages given the image of a person using 2000 images of males as the training set. For the OOD dataset, we hold out the oldest 10% of the people as the OOD set. We used the Wide Residual Network architecture (?) with a depth of 4 and a width factor of 2. As before, we used an ensemble of size 4. The search for the optimal value was over . The stopping epoch is chosen based on the validation NLL not increasing for 10 epochs with an upper limit of 30 epochs. Optimization was done with a learning rate of 0.001, l2 penalty of 0.001 and used the Adam optimizer. The NLL results are in Table 4 whereas the RMSE results are in the Appendix. Both Maximize Overall Diversity and MOD-Adv get the best results on OOD NLL with the improvement being statistically significant over the other methods. MOD gets an NLL of 1.129 on OOD data, MOD-Adv gets an NLL of 1.155 on OOD, and MOD-R gets 1.185 on OOD. This is in contrast to DeepEns which gets only 1.304 on OOD. Thus both MOD and MOD-R show significant improvements on NLL on the OOD data. In addition, while DeepEns+AT has a better mean in-distribution NLL compared to MOD, the focus of this paper is out of distribution uncertainty on which Maximize Overall Diversity and MOD-Adv perform very well. Notably every MOD variant * improves* performance for both in and out of distribution. Thus augmenting the loss function with the MOD penalty should not make your model worse.
Conclusion
We have introduced a loss function and data augmentation strategy that helps stabilize distribution uncertainty estimates obtained from model ensembling. Our method increases model uncertainty over the entire input space while simultaneously maintaining predictive performance, which helps mitigate uncertainty collapse that may arise in small model ensembles. We further proposed two variants of our method. MOD-R assesses the distance of an augmented sample from the training distribution and aims to ensure higher model uncertainty in regions with low-density, and MOD-Adv uses adversarial optimization to improve model uncertainty on relatively over-confident regions more efficiently. Our methods produce improvements to both the in and out of distribution NLL, out of distribution RMSE, and calibration on a variety of datasets drawn from biology, vision, and common UCI datasets. We also showed MOD is useful in hard Bayesian optimization tasks. Future work could develop techniques to generate OOD augmented samples for structured data, as well as applying ensembles with improved uncertainty-awareness to currently challenging tasks such as exploration in reinforcement learning.
Appendix
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Balan et al . 2015] Balan, A. K.; Rathod, V.; Murphy, K. P.; and Welling, M. 2015. Bayesian dark knowledge. In Advances in Neural Information Processing Systems .
- 2[Barrera et al . 2016] Barrera, L. A.; Vedenko, A.; Kurland, J. V.; Rogers, J. M.; Gisselbrecht, S. S.; Rossin, E. J.; Woodard, J.; Mariani, L.; Kock, K. H.; Inukai, S.; et al. 2016. Survey of variation in human transcription factors reveals prevalent DNA binding changes. Science 351(6280):1450–1454.
- 3[Beluch et al . 2018] Beluch, W. H.; Genewein, T.; Nürnberger, A.; and Köhler, J. M. 2018. The power of ensembles for active learning in image classification. In IEEE Conference on Computer Vision and Pattern Recognition .
- 4[Blundell et al . 2015] Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra, D. 2015. Weight uncertainty in neural networks. ar Xiv preprint ar Xiv:1505.05424 .
- 5[Breiman 1996] Breiman, L. 1996. Bagging predictors. Machine Learning 24:123–140.
- 6[Brown 2004] Brown, G. 2004. Diversity in neural network ensembles . Ph.D. Dissertation, University of Birmingham.
- 7[Chen et al . 2017] Chen, R. Y.; Sidor, S.; Abbeel, P.; and Schulman, J. 2017. UCB exploration via Q-ensembles. ar Xiv:1706.01502 .
- 8[Choi and Jang 2018] Choi, H., and Jang, E. 2018. Generative ensembles for robust anomaly detection. ar Xiv preprint ar Xiv:1810.01392 .
