Robustness Against Outliers For Deep Neural Networks By Gradient Conjugate Priors
Pavel Gurevich, Hannes Stuke

TL;DR
This paper introduces a gradient conjugate prior (GCP) network that robustly estimates probability distributions in the presence of outliers, providing explicit bias correction formulas and demonstrating superior performance over existing methods.
Contribution
The paper develops a novel GCP network with explicit bias correction formulas for outlier-affected data, improving robustness in distribution reconstruction.
Findings
GCP network effectively corrects bias caused by outliers.
Fitted mean is close to ground truth within an exponential neighborhood.
Corrected variance remains close to true variance, even with high outlier proportion.
Abstract
We analyze a new robust method for the reconstruction of probability distributions of observed data in the presence of output outliers. It is based on a so-called gradient conjugate prior (GCP) network which outputs the parameters of a prior. By rigorously studying the dynamics of the GCP learning process, we derive an explicit formula for correcting the obtained variance of the marginal distribution and removing the bias caused by outliers in the training set. Assuming a Gaussian (input-dependent) ground truth distribution contaminated with a proportion of outliers, we show that the fitted mean is in a -neighborhood of the ground truth mean and the corrected variance is in a -neighborhood of the ground truth variance, whereas the uncorrected variance of the marginal distribution can even be infinite. We explicitly find as a function…
| Boston | ||
|---|---|---|
| Outliers: 0% | RMSE | AUC |
| Beta | 3.591.51 | 2.140.49 |
| Gamma | 3.641.52 | 2.210.55 |
| BetaBayes | 3.691.52 | 2.530.79 |
| GCPSt | 3.621.60 | 1.920.42 |
| GCP | 3.621.60 | 1.910.41 |
| EnsBeta | 3.711.60 | 2.180.58 |
| EnsGamma | 3.751.65 | 2.350.67 |
| EnsGCP | 3.671.61 | 1.730.42 |
| Outliers: 5% | RMSE | AUC |
| Beta | 3.421.37 | 2.190.51 |
| Gamma | 3.541.46 | 2.220.51 |
| BetaBayes | 3.761.56 | 2.580.81 |
| GCPSt | 3.571.47 | 2.551.11 |
| GCP | 3.571.47 | 2.050.48 |
| EnsBeta | 3.531.48 | 2.190.51 |
| EnsGamma | 3.591.54 | 2.240.57 |
| EnsGCP | 3.611.52 | 1.850.47 |
| Outliers: 10% | RMSE | AUC |
| Beta | 3.311.26 | 2.210.49 |
| Gamma | 3.491.42 | 2.280.52 |
| BetaBayes | 3.791.63 | 2.731.06 |
| GCPSt | 3.631.52 | 2.541.08 |
| GCP | 3.631.52 | 2.100.52 |
| EnsBeta | 3.491.46 | 2.180.53 |
| EnsGamma | 3.551.52 | 2.220.52 |
| EnsGCP | 3.661.52 | 1.900.52 |
| Outliers: 15% | RMSE | AUC |
| Beta | 3.321.24 | 2.330.51 |
| Gamma | 3.421.31 | 2.210.53 |
| BetaBayes | 3.841.62 | 2.640.94 |
| GCPSt | 3.571.42 | 2.330.99 |
| GCP | 3.571.42 | 2.180.67 |
| EnsBeta | 3.451.42 | 2.180.47 |
| EnsGamma | 3.511.44 | 2.190.47 |
| EnsGCP | 3.701.51 | 2.030.64 |
| Outliers: 20% | RMSE | AUC |
| Beta | 3.491.32 | 2.690.72 |
| Gamma | 3.451.36 | 2.330.58 |
| BetaBayes | 3.841.58 | 2.601.06 |
| GCPSt | 3.681.52 | 2.491.02 |
| GCP | 3.681.52 | 2.370.85 |
| EnsBeta | 3.431.38 | 2.290.52 |
| EnsGamma | 3.511.44 | 2.230.50 |
| EnsGCP | 3.691.44 | 2.060.54 |
| Boston | |||
|---|---|---|---|
| LR | D | NE | |
| Beta, Gamma | 0.00002 | 0.4 | 2500 |
| GCP | 0.0001 | 0.3 | 700 |
| Boston | |
|---|---|
| 0.2 | |
| 0.4 |
| Boston | |
|---|---|
| 0.1 | |
| 1 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Machine Learning and Algorithms · Gaussian Processes and Bayesian Inference
Robustness Against Outliers For Deep Neural Networks
By Gradient Conjugate Priors
Pavel Gurevich
Institute of Mathematics,
Free University Berlin
14195 Berlin, Germany
&Hannes Stuke
Institute of Mathematics,
Free University Berlin
14195 Berlin, Germany
[email protected] Peoples’ Friendship University of RussiaEqual contribution
Abstract
We analyze a new robust method for the reconstruction of probability distributions of observed data in the presence of output outliers. It is based on a so-called gradient conjugate prior (GCP) network which outputs the parameters of a prior. By rigorously studying the dynamics of the GCP learning process, we derive an explicit formula for correcting the obtained variance of the marginal distribution and removing the bias caused by outliers in the training set. Assuming a Gaussian (input-dependent) ground truth distribution contaminated with a proportion of outliers, we show that the fitted mean is in a -neighborhood of the ground truth mean and the corrected variance is in a -neighborhood of the ground truth variance, whereas the uncorrected variance of the marginal distribution can even be infinite. We explicitly find as a function of the output of the GCP network, without a priori knowledge of the outliers (possibly input-dependent) distribution. Experiments with synthetic and real-world data sets indicate that the GCP network fitted with a standard optimizer outperforms other robust methods for regression.
1 Introduction
Development of methods robust against outliers in the observed data is an important direction of machine learning and statistics [17]. One distinguishes between input outliers (i.e., outliers in the input space) and output outliers (i.e., wrongly labeled samples ). The former can potentially be detected both during fitting neural networks and when one predicts labels of new data samples. The latter are visible at the fitting stage only and significantly distort the approximate distribution one uses for predictions afterwards. Bayesian neural networks and ensemble methods can naturally detect input outliers at the prediction stage by assigning high uncertainty to them [27, 20]. In order to deal with input outliers at the fitting stage, one can use a covariate shift importance sampling [31, 37], which assumes the knowledge of training and test distributions and of the input variable and downweights the samples with small ratios .
We concentrate on how to mitigate the influence of output outliers at the fitting stage. We will estimate unknown mean and variance of labels111We denote random variables by bold letters and the arguments of their probability distributions by the corresponding non-bold letters. (ground truth distribution) in spite of contamination by an outliers distribution . More specifically, we assume that the labels in the training set have Huber’s contaminated distribution [16]
[TABLE]
where represents the proportion of outliers. Henceforth, we omit conditioning on for notational ease. We assume throughout that the ground truth distribution is univariate Gaussian with mean and variance , and we denote by and the mean and variance of the outliers distribution . We do not impose restrictions on except for a certain polynomial decay at infinity, see technical assumptions in Sec. 3 and the supplement (Appendix B).
The main contributions of this paper are as follows. 1. We prove that outliers cause a qualitative change in the structure of the energy surfaces of the GCP network (analyzed in [11] in the absence of outliers). Namely, outliers make a global minimum bifurcate from infinity to a finite value (Theorem 3.1). In turn, this renders the predictive distribution from Gaussian into Student’s t, whose variance may be significantly larger than the ground truth variance . 2. We show how the knowledge of the above finite equilibrium allows one to reconstruct the ground truth mean and variance (Theorems 4.1 and 4.2).
Our experiments in Sec. 5 with synthetic and real-world data sets indicate that the GCP network, fitted with a standard optimizer (Adam in our case), outperforms other robust methods, particularly by properly estimating the ground truth variance.
1.1 Main idea
For each in the input space, we define a probabilistic model for a random variable and latent variables
[TABLE]
where the likelihood is assumed Gaussian with mean and precision , while the latter are treated as random variables with a normal-gamma distribution . The parameters are functions of the input , and are represented as outputs of multi-layer neural networks (Fig. 1, left).
The marginal likelihood appears Student’s t-distribution
[TABLE]
In the standard Bayesian approach and -independent case, one updates based on observations . However, this is not possible in the neural networks framework because, on one hand, different belong to different input points and, on the other hand, one cannot update the outputs of a neural network directly. The theory of Bayesian neural networks suggests to treat the weights of neural networks as random variables with a certain prior and to approximate their (usually analytically untractable) posterior [25, 36, 3, 19, 8, 14, 23, 22]. Instead, we follow the gradient conjugate prior (GCP) method proposed in [11]. We treat the weights of the neural networks as deterministic parameters. Given an observation corresponding to an input , one can explicitly find the parameters of the posterior distribution of . We perform a gradient descent step towards minimization of the Kullback–Leibler (KL) divergence from the posterior to the prior, where the gradient is taken with respect to the weights of the neural networks representing . It is shown in [11] that the GCP update is equivalent to maximizing the marginal log-likelihood . Furthermore, the above update of the weights induces an update of , which allows one to write a dynamical system (in the limit as the learning rate goes to [math]) for the evolution of for each input . This dynamical system takes the form
[TABLE]
where is the above KL-divergence, stands for the derivative with respect to fictitious time and the expectations are taken with respect to the contaminated distribution in (1), see details in Sec. 2.
By analyzing system (5), we show that, for small , the parameters (see (4)) converge to a finite equilibrium. We denote it by again (slightly abusing notation) and set
[TABLE]
We call these quantities the prognistic mean and variance.222As opposed to the predictive variance in (9) of marginal Student’s t-distribution in (3). Here is monotone increasing from [math] to and satisfies for all , see Fig. 4 in the supplement. It is defined as a unique root of the equation
[TABLE]
Due to Lemma C.2 in the supplement, is well defined for all . We show that, for small , the prognostic mean is exponentially close to (Theorem 4.1), while the prognostic variance is linearly close to (Theorem 4.2), namely,
[TABLE]
where and is defined in (56) in the supplement. We emphasize the novelty of the prognostic variance in (6), which provides a correction of the usually used variance of the marginal distribution (3). In our case, the latter is Student’s t variance
[TABLE]
Note that ; moreover, if . Therefore, even though Student’s t-distribution is a popular choice in robust statistics and indeed provides a robust estimate of the mean, it significantly overestimates the ground truth variance in the presence of outliers, yielding an error . Our analysis allows us to recover the ground truth variance via (6) up to an error due to (8).
A practical algorithm for fitting GCP networks is given in the supplement (Appendix A).
1.2 Related work
There are several related approaches to mitigating the influence of output outliers. One popular approach is based on fitting heavy-tailed distributions, such as Student’s t [21, 24, 29]. Effectively, our GCP method also fits Student’s t-distribution, but additionally it reconstructs the ground truth variance via (6) or (8). Localization of a probabilistic model [34] generalizes heavy-tailed distributions. Localization principle allows the likelihood of each sample to depend on its own copy of a latent variable, while all the copies obey the same probability distribution. In particular cases, marginalizing the latent variables gives rise to Student’s t marginal likelihood. Another body of methods uses data reweighting. One can manually assign binary weights to samples [17] or use the Bayesian framework [35], in which the likelihood of each sample is raised to a power being a latent variable. The posterior of these latent variables is inferred together with the posterior of other latent variables in the model. Another type of reweighting is provided by so-called robust divergences, which are used instead of the Kullback–Leibler divergence either in directly approximating the ground truth distribution or in learning the posterior distribution of the parameters. For example, the -entropy was used in [5], while - and -divergences were studied in [1, 10, 7]. A number of papers develop robust gradient descent methods by detecting and reweighting the gradients of outliers during backpropagation [15, 28, 39] or by removing outliers from a fitted model followed by refitting [4]. We emphasize that our GCP approach, in contrast to the above methods, can be trained in one run with any standard optimizer (such as Adam, RMSprop, etc.), and it does not require fine tuning additional hyperparameters or explicitly estimating the contamination proportion . On the other hand, knowing , one can reduce the error for the variance estimation to , see (8).
2 The GCP approach
We recall the GCP approach introduced in [11] and outlined in Sec. 1.1.
2.1 GCP update
We describe an update of in (2) directly, assuming to be fixed. We refer to Remark 2.1 and to [11] for details concerning an update of the weights of neural networks representing . Suppose we observe a new sample . Then, using the Bayes theorem, we find the conditional distribution of under the condition that . This posterior distribution denoted by is also normal-gamma [2], namely, where the parameters are updated as follows:
[TABLE]
However, in the framework of neural networks, one cannot update directly. Instead, we fix according to (10) and use the KL divergence from to , see [30]:
[TABLE]
where is the digamma function and is the gamma function. After that, we update by performing a gradient descent step in the direction . Recalling that the observations are sampled from the contaminated distribution , we can approximate the fitting process by the dynamical system (5).
Remark 2.1**.**
If are parametrized by weights of neural networks, then the gradient of must be taken with respect to those weights, see the algorithm in the supplement (Appendix A). The dynamics of the weights will induce a dynamics of with the right-hand sides that contain the gradients of with respect to the weights [11]. However, they will enter as prefactors in (5). Hence any equilibrium of (5) will be an equilibrium of the dynamical system for the weights.
2.2 Explicit dynamical system
Dynamical system (5) can be explicitly written as follows (cf. (3.4)–(3.7) in [11]):
[TABLE]
where
[TABLE]
the integrals are taken over , , is defined in (1), and . Equations (14) imply
[TABLE]
The first goal of this paper is to show that, given outliers (), fitting the parameters by the GCP method automatically yields finite values of and . Theorem 3.1 shows that finite and occur via bifurcation at infinity as becomes nonzero. The second goal is to show that the obtained prognostic mean and variance in (6) do approximate the ground truth mean and variance in the sense of (8) given the output of a fitted GCP network. This is done in Theorems 4.1 and 4.2.
3 Bifurcation of predictive distribution from Gaussian to Student’s t
In this section, we show that an arbitrarily small percentage of outliers qualitatively changes the dynamics of (12)–(14), (18), namely, it makes and converge to finite values. This changes the predictive distribution from Gaussian to Student’s t. We will prove that this happens via bifurcation of the equilibrium at infinity (Fig. 1, right). In the next section, we show that the correction given by the prognostic variance in 6 is -close to the ground truth variance .
Denote by the th central moment of . The following technical assumption requires that the mean or the variance of outliers be large enough, or have heavy tails. It is used only in this section and does not depend on .
Condition 3.1**.**
The outliers distribution satisfies , where
[TABLE]
We will see that plays a role of an indicator of outliers. The larger is compared with , the better the GCP method recognizes samples from as outliers and the better it filters them out. A similar role of an indicator will be played by the absolute value of the constant
[TABLE]
Theorem 3.1**.**
Let Condition 3.1 hold with some . Then, for all sufficiently small , there exists a unique equilibrium of system (12), (13), (18). The following asymptotics is true as :
[TABLE]
[TABLE]
Theorem 3.1 is proved in the supplement.
4 Prognostic mean and variance
The main practical question we answer in this section is the following. Given a finite equilibrium (as observed after the model is fitted), what can we tell about the ground truth mean and variance ? Due to (6), the equilibrium uniquely determines prognostic mean and variance . Thus, for each , there remain 4 unknowns in the 3 equations (see (28)–(30)). In this section, we assume they are functions of and obtain their asymptotics for small under the following condition.
Condition 4.1**.**
Either or is constant in .
The next theorem shows that the prognostic mean is exponentially close to .
Theorem 4.1**.**
Let , where is an arbitrary distribution with zero mean and unit variance. Let be an equilibrium (independent of ) for system (12), (13), (18). Let be bounded for all small . Then there is an equilibrium of Eq. (12) such that
[TABLE]
for some that does not depend on and .
The proof is given in the supplement (Appendix C).333Theorem 4.1 is proved under the assumption that either or is bounded for small , which is weaker than Condition 4.1.
Next, we analyze how much the prognostic variance in (6) differs from the ground truth variance . Theorem 4.1 shows that the equilibrium of (12) is exponentially close to . Therefore, to simplify our next statement and the technicalities of its proof, we assume that .
Theorem 4.2**.**
Let , where is an arbitrary distribution with zero mean and unit variance. Let be an equilibrium (independent of ) for system (13), (18) with . Then
[TABLE]
*where is defined in in the supplement. *
The proof is given in the supplement (Appendix D). Moreover, we prove therein that any finite is realizable as an equilibrium for some .
Asymptotics (24) should be compared with Student’s t variance in (9), which yields an error of order if and an infinite error if .
5 Experiments
5.1 Methods
We compare the following robust methods:444Our preliminary results with robust gradient descent in [15] and [39] were significantly worse than those obtained by the other methods, especially in case of input-dependent variance in the loss. Therefore we do not include them in Table 1. We did not implement the robust gradient estimation in [28] because fine tuning its hyperparameters requires the knowledge of , see Sec. 3.3 therein.
Beta and Gamma: the methods in which one minimizes, respectively, the - and -divergences from the ground truth to the approximating normal distribution [1, 10, 6]; 2. 2.
BetaBayes: the robust Bayesian method based on the -divergence555We performed a grid search for and the (input-independent) standard deviation of the likelihood. By varying these two parameters, one obtains the same set of loss functions as by varying and the standard deviation in the robust Bayesian method in [7] based on the -divergence. Therefore, we do not include the latter method as a separate one in our comparison list. [7]; 3. 3.
GCPSt: the GCP with the Student’s t variance , https://github.com/hstuk/GCP; 4. 4.
GCP: the GCP with the prognostic variance , https://github.com/hstuk/GCP; 5. 5.
EnsBeta, EnsGamma, EnsGCP: ensembles of 5 Beta, Gamma, and GCP respectively.
Note that the Beta, Gamma and the GCP-based methods estimate aleatoric uncertainty since they learn the variance of labels conditioned on the input , while the Bayesian method BetaBayes estimates epistemic uncertainty since the variance of the likelihood is treated as a hyperparameter, while the predictive variance is -dependent only due to randomness in the weights. The ensemble methods are supposed to learn both aleatoric and epistemic uncertainty, and their overall variance is computed as the variance of the Gaussian mixture distribution, cf. [20]. Architectures and hyperparameters for all methods are given in the supplement.
5.2 Synthetic data set
We generate a synthetic data set containing 5% of outliers. To do so, we choose the set consisting of 400 points uniformly distributed on the interval . For each , with probability 0.95 we sample from the normal distribution with mean and standard deviation , and with probability 0.05 we sample from a uniform distribution on the interval . Figure 2 shows the data and the fits for different methods. Even though the means are accurately predicted by most robust methods, the GCP learns the variance best. Furthermore, the output of the GCP network provides additional information, namely, small values of indicate that the corresponding samples belong to a (less trust-worthy) region in which the training set contained outliers.
5.3 Real world data sets
Data sets. We analyze the following publicly available data sets: Boston House Prices [13] ( samples, 13 features), Concrete Compressive Strength [38] ( samples, 8 features), Combined Cycle Power Plant [33, 18] ( samples, 4 features), Yacht Hydrodynamics [9, 26] (308 samples, 6 features), and Kinematics of an 8 Link Robot Arm Kin8Nm666http://mldata.org/repository/data/viewslug/regression-datasets-kin8nm/ (8192 samples, 8 feature). Each data set is randomly split into train-test subsets with 95% of samples in the training subset. For each training set, we randomly choose % of samples and replace them by outliers. The outliers are sampled from the Gaussian distribution with the mean equal to the mean over all the targets in the original training set and standard deviation equal to ten times the standard deviation over the targets in the original training set. All the results reported below are the respective averages over 50 cross-validations.
Measures. We use two measures of the quality of the fit.
- The overall root mean squared error (RMSE).
- The area under the following curve (AUC), measuring the trade-off between properly learning the mean and the variance. Assume the test set contains samples. We order them with respect to their predicted variance. For each , we remove samples with the highest variance and calculate the RMSE for the remaining samples (with the lowest variance). We denote it by and plot it versus as a continuous piecewise linear curve. The second measure is the area under this curve normalized by :
Results. Table 1 presents777Symbol indicates that we were not able to fine tune the parameters of the BetaBayes to obtain reasonable predictions for Power and Kin8nm data sets. Note that the authors in [7] used a protocol for fitting BetaBayes different from ours. Unlike us, they first normalized the noncontaminated training set and then added outliers. RMSE and AUC scores for the outliers’ percentage , respectively. In each column, we mark a method in bold if it is significantly (due to the two-tailed paired difference test with ) better or indistinguishable from all the other methods. We see that the GCP significantly improves AUC scores of the GCPSt in the presence of outliers. Furthermore, EnsGCP yields the best AUC among all the methods for all , see also Fig. 3. Thus, it provides the best trade-off between properly learning the mean and the variance. Its RMSE score is competitive or superior to the other methods. Moreover, after removing a small number of samples for which EnsGCP predicts a high variance, its RMSE for the remaining samples becomes significantly better than the respective RMSE of the other methods, see the curves in Fig. 5 in the supplement.
6 Conclusion
We analyzed the minima of the energy surfaces of the GCP networks encoding the priors of latent variable models. Under the assumption of Huber’s -contamination of the Gaussian ground truth distribution , we obtained formulas for prognostic mean and variance in terms of the outputs of the GCP networks, yielding errors for the ground truth mean and variance of order and respectively.
The GCP networks can be trained with standard optimizers (such as Adam, RMSProp, etc.) and do not require fine tuning additional hyperparameters. Experiments with synthetic and real world data with outliers showed their superiority over several other state-of-art robust methods based on neural networks.
Appendix A Algorithm for fitting a GCP network and predicting the mean and variance of the ground truth distribution
In this section, we present a practical algorithm for defining a loss of a GCP network, fitting it, and predicting the mean and variance of the ground truth distribution in a robust way. The code is available at https://github.com/hstuk/GCP.
Given an input and a vector of weights , we denote the -dimensional output of the GCP network by . The outputs can share the weights or have independent weights, in which case . For each labeled sample with , , we define a loss according to Algorithm 1 or Algorithm 2. According to [11, Lemma 2.1], these two algorithms yield the same loss up to an additive constant not depending on .
Given the loss defined in Algorithm 1 or 2 and a training set , we fit the GCP network by minimizing
[TABLE]
using any standard optimizer (e.g., Adam, RMSProp, etc.). Once the GCP network is fitted, we predict the mean and variance of the ground truth distribution as follows (see Eq. (6)):
[TABLE]
where is defined as a unique root of Eq. (7). The function can be precalculated in advance or, due to [11], approximated by
[TABLE]
see Fig. 4.
Remark A.1**.**
The fitted GCP network minimizes the log-likelihood of Student’s t-distribution , see Algorithm 2. One can rewrite the above prognostic variance in terms of , namely
[TABLE]
This approach would reduce the -dimensional output of the GCP network to the -dimensional output directly encoding the parameters of Student’s t distribution. However, the resulting dynamics of the weights and the induced dynamics of (a counterpart for dynamical system (12)–(14)) is an open question, which is a direction for future research.
Appendix B Proof of Theorem
We assume throughout the proof that is continuously differentiable, its sixth central moment exists, and there is such
[TABLE]
and
[TABLE]
for all .
- Without loss of generality, assume that
[TABLE]
First, we show that system (12)–(14) has at least one equilibrium . To do so, it suffices to prove that the system of equations
[TABLE]
(where the integrals are take over , , is defined in (1), and ) has a root .
First, we solve Eq. (30) with respect to . Setting , we have for :
[TABLE]
Hence, additionally using the decay of at infinity to estimate the integral for , we obtain
[TABLE]
Here
[TABLE]
the functions for (and below) are smooth for and in a neighborhood of the origin, and their partial derivatives with respect to and are as uniformly for and in a neighborhood of the origin.
Solving for yields
[TABLE]
where, for brevity, we omitted the dependence of the functions on their arguments.
Using the Taylor formula for the logarithm and the asymptotic expansion of , we have
[TABLE]
Plugging in given by (31) into (32) and dividing by , we see that, for , system (28)–(30) is equivalent to
[TABLE]
We solve system (33)–(35) with respect to , using the implicit function theorem. Note that due to (27) and since and are the second and the fourth central moments of the Gaussian distribution . At , we have
[TABLE]
The vector of -derivatives at is
[TABLE]
Hence, by the implicit function theorem, there exist such that for any , system (33), (35) has a unique root in the set
[TABLE]
Moreover, are smooth functions of and
[TABLE]
In particular, (37) shows that and hence . Combining (37) with (31) proves asymptotics (22). To prove asymptotics (21), we substitute and into (33). This yields
[TABLE]
where is the third moment about [math] for the outliers distribution . Rewriting via the central moments, we see that the constant equals the coefficient at in (21).
- It remains to show that system (12)–(14) has no other equilibrium except for that found in part 1 of the proof. Assume, to the contrary, that there is a sequence and the respective sequence of solutions of system (28)–(30) that is different for each from those in part 1 of the proof.
First, we show that there exists (independent of and ) such that . Assume this is not true. First consider the case where is bounded. Let (the case ) is treated similarly. We rewrite Eq. (28) as follows:
[TABLE]
where
[TABLE]
[TABLE]
[TABLE]
where do not depend on . Further, we choose such that
[TABLE]
for all . Then, using the assumption that is bounded, we have for all sufficiently large
[TABLE]
where does not depend on . Relations (39)–(41) contradict (38).
Consider the case . Then , and we rewrite Eq. (28) as follows:
[TABLE]
Then
[TABLE]
where as uniformly with respect to . This again contradicts the assumption . Thus, any root of Eq. (28) indeed satisfies .
- Further, we show that is bounded away from [math]. Assume, to the contrary, that (possibly after passing to a subsequence) . Then, due to (30), . Expressing via in (30) and using the fact that is bounded, we immediately see that for all sufficiently large , where does not depend on . On the other hand, (29) is equivalent to
[TABLE]
Since is bounded, the latter equality yields for all sufficiently large , where does not depend on . This contradicts the first inequality for .
- Due to part 2, we can assume (possibly after passing to a subsequence) that for some . If is bounded, then (possibly after passing to a subsequence) and due to part 3. Then by Theorem 3.1 in [11], . Furthermore, since , it follows from (30) that . Thus, solve the equations (28), (30) with . However, by Theorem 3.2, item (c) in [11], the system of these two equations has no solution for . Therefore, , and for sufficiently large , they enter a region where, by part 1, the solution is unique.
Appendix C Proof of Theorem
For the proof of Theorem 4.1, we need two auxiliary results, which are given in the next two subsections.
C.1 Prognostic mean for any fixed
In this subsection, we assume that is fixed and is not necessarily small, and analyze how the equilibrium of Eq. (12) gets perturbed compared with the ground truth mean , provided that or is large. We will see that the larger the values of or are, the better the samples from are recognized as outliers and the stronger gets shifted towards .
Lemma C.1**.**
Let , where is an arbitrary distribution with zero mean and unit variance. We fix and . Then the following hold for all .
If is large enough, then Eq. (12) has an equilibrium in a neighborhood of satisfying
[TABLE]
as where
[TABLE] 2. 2.
If is large enough, then Eq. (12) has an equilibrium in a neighborhood of satisfying, for any ,
[TABLE]
In both cases, is uniform with respect to , , and from bounded intervals.
Proof.
Without loss of generality, assume that .
Proof of item 1.
We set and apply the implicit function theorem to
[TABLE]
We have . Integrating by parts yields
[TABLE]
where is defined in (43). Further, . Hence, there is a neighborhood of in which Eq. (45) has a unique root for each fixed , and
[TABLE]
Finally, one can check that the second derivatives of are continuous in a neighborhood of , which implies the Taylor expansion of equivalent to (42). ∎
Proof of item 2.
We fix an arbitrary and set , so that . We will apply the implicit function theorem to
[TABLE]
We have , , . Hence, there is a neighborhood of in which Eq. (46) has a unique root for each fixed , and
[TABLE]
Furthermore, one can check that the second partial derivatives of are continuous in a neighborhood of . Therefore, as . Since is arbitrary, the latter asymptotics is equivalent to (44). ∎
C.2 An auxiliary algebraic relation
For the reader’s convenience, we formulate the following lemma, which is proved in [11, Lemma 3.1]
Lemma C.2**.**
For each , the equation
[TABLE]
with respect to has a unique root . The function is monotone increasing from [math] to and satisfies for all , see Fig. 4.
It implies the following corollary.
Corollary C.1**.**
For each , the equation
[TABLE]
with respect to has a unique root , where is defined in Lemma C.2.
C.3 Proof of Theorem
- We set and .
Then that satisfies
[TABLE]
where is uniformly bounded with respect to all its arguments. Further, is bounded by assumption, and Eq. (49) with the zero right hand side has a unique solution . Therefore, there exists as uniformly with respect to , , and .
- Due to (13), (18), the equilibrium satisfies
[TABLE]
[TABLE]
Note that the functions and coincide with those in (29) and (30) (the latter up to a sign), but here we explicitly indicate their dependence on , , , and .
Using Corollary C.1 and the fact that , we can pass to the limit in (51) as , and we see that . Hence, passing to the limit in (50), we have
[TABLE]
where
[TABLE]
Since or are bounded by assumption, we obtain or , respectively. Combining this with Lemma C.1 concludes the proof.
Appendix D Proof of Theorem
In the formulation of Theorem 4.2, we use the constant
[TABLE]
In the proof, we will also need the constant
[TABLE]
Note that after substituting given by (6), the variable cancels. Thus and are indeed functions of only, with .
We will prove the theorem under the assumption that does not depend on . The case where does not depend on is analogous.
Without loss of generality, assume that . Due to (13), (18), the equilibrium satisfies
[TABLE]
[TABLE]
Note that the functions and are the same as in (50) and (51), but we omit the dependence on , which is assumed to coincide with .
We will show that one can find unique roots and of Eq. (55) and (56) as functions of (and the other parameters) and determine their asymptotics, provided is small. First, assume that and exist for all sufficiently small . Then is bounded as . Otherwise, passing to the limit in (56), we would obtain . Furthermore, it is bounded away from zero. Otherwise, passing to the limit in (56), we would obtain . Thus, in what follows, it suffices to consider from a bounded interval separated from zero.
We introduce the variable instead of such that and prove existence of . Here is given by (52) and by (54).
First, we solve Eq. (56) for . Consider the function . Note that there is independent of such that for all and ,
[TABLE]
and is monotone with respect to . Hence, Eq. (56) has a unique root for all and , and, due to Corollary C.1, , where is uniform with respect to all . The partial derivatives of with respect to all its arguments are continuous for all , and . Furthermore, as , we have
[TABLE]
Hence, by the implicit function theorem, is continuously differentiable with respect to for all and . In particular,
[TABLE]
where is defined in Eq. (53).
We substitute and into (55), and obtain the equation
[TABLE]
where
[TABLE]
Note that
[TABLE]
where
[TABLE]
and is defined in (27). Therefore,
[TABLE]
where and are defined in (52) and (54) and are bounded and continuous for and all .
Further,
[TABLE]
where
[TABLE]
Combining (58)–(61), we see that, for , Eq. (58) is equivalent to
[TABLE]
We have , the partial derivatives of are continuous for all and , and . Hence, by the implicit function theorem, there exist small and such that Eq. (62) has a unique solution for all , . This solution is continuously differentiable in a neighborhood of the origin. Similarly, there is a unique solution for all , . To prove that there are no solutions outside of these two -regions, one can show that is monotone decreasing for , monotone increasing for , and . This proves (29). Applying the chain rule to also yields (28).
Appendix E The RMSE() curves
Figure 5 shows the RMSE() curves for different methods and data sets from Sec. 6.3, fitted on training sets contaminated by 5% of outlier. We see that (possibly after removing a small number of samples for which EnsGCP predicts a high variance) its RMSE is significantly better than the respective RMSE of the other methods, see the curves in Fig. 5 in the supplement.
Appendix F Architectures and hyperparameters
F.1 Architectures
We use one-hidden layer networks with 50 ReLU nonlinearities for the Beta, Gamma, and GCP. Whenever a method uses several quantities (e.g., the mean and variance in the Beta and Gamma, or in the GCP), we approximate each quantity by a separate network. For regularization in non-Bayesian methods, we use a dropout layer between the hidden layer and the output unit. Our approach is directly applicable to neural networks of any depth and structure, however we keep one hidden layer for the compatibility of our validation with [14, 20, 12]. For BetaBayes, we used the architecture from the authors’ code888https://github.com/futoshi-futami/Robust_VI.
F.2 Hyperparameters
When we fit all the methods except the BetaBayes, we first contaminate the training set by outliers and then normalize it such that the input features and the targets have zero mean and unit variance. For the BetaBayes, significantly better results were achieved without normalizing the targets999Note that the authors in [7] used a different protocol, namely they normalized the noncontaminated training set and then added outliers to it..
For the Beta, Gamma, and GCP, we used minibatch 5 on Boston, Concrete, and Yacht, and minibatch 10 on Power and Kin8nm. We used Adam (with , ) optimizer for fitting, and performed a grid search for the learning rate in the range and for the dropout rate in the range . For the Beta and Gamma, we optimized for the learning rate with the fixed parameters and , respectively. After that, we additionally performed a grid search for and in the range . We observed that changing the learning rate for the newly found values of and did not significantly improve the results. All the grid searches was performed for training data sets with 5% of outliers and evaluated on the noncontaminated test sets. The optimized parameters are given in Table 2. For the ensemble methods, we used the hyperparameters that were optimal for the respective non-ensemble methods, but with the half dropout rate. For BetaBayes, we used the architecture, the default settings and the optimizer based on the Edward library [32] as in the authors’ code101010https://github.com/futoshi-futami/Robust_VI, and we performe a grid search for the parameter and the standard deviation of the likelihood
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Basu et al. [1998] Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. Robust and efficient estimation by minimising a density power divergence. Biometrika , 85(3):549–559, 1998.
- 2Bishop [2006] Bishop, C. Pattern Recognition and Machine Learning . Springer, 2006.
- 3Blundell et al. [2016] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on Machine Learning , pp. 7–9, July 2015, Lille, France, JMLR, 2016. W&CP 37 , 1613.
- 4Diakonikolas et al. [2018] Diakonikolas, I., Kamath, G., and Kane, D. M. Sever: A robust meta-algorithm for stochastic optimization. ar Xiv:1803.02815 [cs.LG] , 2018.
- 5Ferrari & Yang [2010] Ferrari, D. and Yang, Y. Maximum lq-likelihood estimation. Annals of Statistics , 38(2):753–783, 2010.
- 6Fujisawa & Eguchi [2008] Fujisawa, H. and Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis , 99(9):2053–2081, 2008.
- 7Futami et al. [2017] Futami, F., Sato, I., and Sugiyama, M. Variational inference based on robust divergences. 31st Annual Conference on Neural Information Processing Systems (NIPS 2017) , pp. 4–9, 2017.
- 8Gal & Ghahramani [2016] Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In In Proceedings of the 33rd International Conference on Machine Learning , New York, New York, USA, JMLR, 2016. W&CP 48 , 1050.
