Robustness Against Outliers For Deep Neural Networks By Gradient   Conjugate Priors

Pavel Gurevich; Hannes Stuke

arXiv:1905.08464·stat.ML·May 22, 2019

Robustness Against Outliers For Deep Neural Networks By Gradient Conjugate Priors

Pavel Gurevich, Hannes Stuke

PDF

Open Access

TL;DR

This paper introduces a gradient conjugate prior (GCP) network that robustly estimates probability distributions in the presence of outliers, providing explicit bias correction formulas and demonstrating superior performance over existing methods.

Contribution

The paper develops a novel GCP network with explicit bias correction formulas for outlier-affected data, improving robustness in distribution reconstruction.

Findings

01

GCP network effectively corrects bias caused by outliers.

02

Fitted mean is close to ground truth within an exponential neighborhood.

03

Corrected variance remains close to true variance, even with high outlier proportion.

Abstract

We analyze a new robust method for the reconstruction of probability distributions of observed data in the presence of output outliers. It is based on a so-called gradient conjugate prior (GCP) network which outputs the parameters of a prior. By rigorously studying the dynamics of the GCP learning process, we derive an explicit formula for correcting the obtained variance of the marginal distribution and removing the bias caused by outliers in the training set. Assuming a Gaussian (input-dependent) ground truth distribution contaminated with a proportion $ε$ of outliers, we show that the fitted mean is in a $c e^{- 1/ ε}$ -neighborhood of the ground truth mean and the corrected variance is in a $b ε$ -neighborhood of the ground truth variance, whereas the uncorrected variance of the marginal distribution can even be infinite. We explicitly find $b$ as a function…

Tables4

Table 1. Table 1: RMSE and AUC scores for 0%, 5%, 10%, 15%, and 20% of outliers.

	Boston
Outliers: 0%	RMSE	AUC
Beta	3.59 $\pm$ 1.51	2.14 $\pm$ 0.49
Gamma	3.64 $\pm$ 1.52	2.21 $\pm$ 0.55
Beta_Bayes	3.69 $\pm$ 1.52	2.53 $\pm$ 0.79
GCP_St	3.62 $\pm$ 1.60	1.92 $\pm$ 0.42
GCP	3.62 $\pm$ 1.60	1.91 $\pm$ 0.41
EnsBeta	3.71 $\pm$ 1.60	2.18 $\pm$ 0.58
EnsGamma	3.75 $\pm$ 1.65	2.35 $\pm$ 0.67
EnsGCP	3.67 $\pm$ 1.61	1.73 $\pm$ 0.42
Outliers: 5%	RMSE	AUC
Beta	3.42 $\pm$ 1.37	2.19 $\pm$ 0.51
Gamma	3.54 $\pm$ 1.46	2.22 $\pm$ 0.51
Beta_Bayes	3.76 $\pm$ 1.56	2.58 $\pm$ 0.81
GCP_St	3.57 $\pm$ 1.47	2.55 $\pm$ 1.11
GCP	3.57 $\pm$ 1.47	2.05 $\pm$ 0.48
EnsBeta	3.53 $\pm$ 1.48	2.19 $\pm$ 0.51
EnsGamma	3.59 $\pm$ 1.54	2.24 $\pm$ 0.57
EnsGCP	3.61 $\pm$ 1.52	1.85 $\pm$ 0.47
Outliers: 10%	RMSE	AUC
Beta	3.31 $\pm$ 1.26	2.21 $\pm$ 0.49
Gamma	3.49 $\pm$ 1.42	2.28 $\pm$ 0.52
Beta_Bayes	3.79 $\pm$ 1.63	2.73 $\pm$ 1.06
GCP_St	3.63 $\pm$ 1.52	2.54 $\pm$ 1.08
GCP	3.63 $\pm$ 1.52	2.10 $\pm$ 0.52
EnsBeta	3.49 $\pm$ 1.46	2.18 $\pm$ 0.53
EnsGamma	3.55 $\pm$ 1.52	2.22 $\pm$ 0.52
EnsGCP	3.66 $\pm$ 1.52	1.90 $\pm$ 0.52
Outliers: 15%	RMSE	AUC
Beta	3.32 $\pm$ 1.24	2.33 $\pm$ 0.51
Gamma	3.42 $\pm$ 1.31	2.21 $\pm$ 0.53
Beta_Bayes	3.84 $\pm$ 1.62	2.64 $\pm$ 0.94
GCP_St	3.57 $\pm$ 1.42	2.33 $\pm$ 0.99
GCP	3.57 $\pm$ 1.42	2.18 $\pm$ 0.67
EnsBeta	3.45 $\pm$ 1.42	2.18 $\pm$ 0.47
EnsGamma	3.51 $\pm$ 1.44	2.19 $\pm$ 0.47
EnsGCP	3.70 $\pm$ 1.51	2.03 $\pm$ 0.64
Outliers: 20%	RMSE	AUC
Beta	3.49 $\pm$ 1.32	2.69 $\pm$ 0.72
Gamma	3.45 $\pm$ 1.36	2.33 $\pm$ 0.58
Beta_Bayes	3.84 $\pm$ 1.58	2.60 $\pm$ 1.06
GCP_St	3.68 $\pm$ 1.52	2.49 $\pm$ 1.02
GCP	3.68 $\pm$ 1.52	2.37 $\pm$ 0.85
EnsBeta	3.43 $\pm$ 1.38	2.29 $\pm$ 0.52
EnsGamma	3.51 $\pm$ 1.44	2.23 $\pm$ 0.50
EnsGCP	3.69 $\pm$ 1.44	2.06 $\pm$ 0.54

Table 2. Table 2: Learning rate LR , dropout rate D , and the number of epochs NE for the Beta, Gamma and GCP-based methods.

	Boston
	LR	D	NE
Beta, Gamma	0.00002	0.4	2500
GCP	0.0001	0.3	700

Table 3. Table 3: Optimal values of β 𝛽 \beta and γ 𝛾 \gamma for the Beta and Gamma methods.

	Boston
$β$	0.2
$γ$	0.4

Table 4. Table 4: Optimal values of β 𝛽 \beta and the standard deviation σ 𝜎 \sigma of the likelihood for the Beta Bayes method. Symbol ∗ * indicates that we were not able to fine tune the parameters of the Beta Bayes to obtain reasonable predictions for Power and Kin8nm data sets. Note that the authors in [ 7 ] used a protocol for fitting Beta Bayes different from ours. Unlike us, they first normalized the noncontaminated training set and then added outliers to it

	Boston
$β$	0.1
$σ$	1

Equations211

p_{c} (y ∣ x) = (1 - ε) p_{g} (y ∣ x) + ε p_{o} (y ∣ x),

p_{c} (y ∣ x) = (1 - ε) p_{g} (y ∣ x) + ε p_{o} (y ∣ x),

p (y, μ, τ) = p (y ∣ μ, τ) p (μ, τ ∣ m, ν, α, β),

p (y, μ, τ) = p (y ∣ μ, τ) p (μ, τ ∣ m, ν, α, β),

t_{2 α} (y ∣ m, σ / α) = \int p (y, μ, τ) d μ d τ,

t_{2 α} (y ∣ m, σ / α) = \int p (y, μ, τ) d μ d τ,

σ := \frac{β ( ν + 1 )}{ν} .

\overset{m}{˙} = - E [\frac{\partial K}{\partial m}], \overset{α}{˙} = - E [\frac{\partial K}{\partial α}], \dot{β} = - E [\frac{\partial K}{\partial β}], \overset{ν}{˙} = - E [\frac{\partial K}{\partial ν}],

\overset{m}{˙} = - E [\frac{\partial K}{\partial m}], \overset{α}{˙} = - E [\frac{\partial K}{\partial α}], \dot{β} = - E [\frac{\partial K}{\partial β}], \overset{ν}{˙} = - E [\frac{\partial K}{\partial ν}],

m_{p} := m, V_{p} := \frac{σ}{α - A ( α )} .

m_{p} := m, V_{p} := \frac{σ}{α - A ( α )} .

\frac{2 α + 1}{2 π} \int \frac{y ^{2}}{2 ( α - A ) + y ^{2}} e^{- y^{2} /2} d y - 1 = 0.

\frac{2 α + 1}{2 π} \int \frac{y ^{2}}{2 ( α - A ) + y ^{2}} e^{- y^{2} /2} d y - 1 = 0.

m_{\rm g}=m_{\rm p}+O\bigl{(}e^{-c/\varepsilon}\bigr{)},\quad V_{\rm g}=(1-b\varepsilon)V_{\rm p}+O(\varepsilon^{2}),

m_{\rm g}=m_{\rm p}+O\bigl{(}e^{-c/\varepsilon}\bigr{)},\quad V_{\rm g}=(1-b\varepsilon)V_{\rm p}+O(\varepsilon^{2}),

V_{St} := \frac{σ}{α - 1} (α > 1), V_{St} := \infty (α \leq 1) .

V_{St} := \frac{σ}{α - 1} (α > 1), V_{St} := \infty (α \leq 1) .

m^{'} = \frac{ν m + y}{ν + 1}, ν^{'} = ν + 1, α^{'} = α + \frac{1}{2}, β^{'} = β + \frac{ν}{ν + 1} \frac{( y - m ) ^{2}}{2} .

m^{'} = \frac{ν m + y}{ν + 1}, ν^{'} = ν + 1, α^{'} = α + \frac{1}{2}, β^{'} = β + \frac{ν}{ν + 1} \frac{( y - m ) ^{2}}{2} .

K (m, ν, α, β) := \frac{α ^{'} ( m - m ^{'} ) ^{2} ν}{2 β ^{'}} + \frac{ν}{2 ν ^{'}} - \frac{1}{2} ln \frac{ν}{ν ^{'}} - \frac{1}{2}

K (m, ν, α, β) := \frac{α ^{'} ( m - m ^{'} ) ^{2} ν}{2 β ^{'}} + \frac{ν}{2 ν ^{'}} - \frac{1}{2} ln \frac{ν}{ν ^{'}} - \frac{1}{2}

- α ln \frac{β}{β ^{'}} + ln \frac{Γ ( α )}{Γ ( α ^{'} )} - (α - α^{'}) Ψ (α^{'}) + \frac{α ^{'} ( β - β ^{'} )}{β ^{'}},

\overset{m}{˙} = (2 α + 1) F (m, σ, ε),

\overset{m}{˙} = (2 α + 1) F (m, σ, ε),

\overset{α}{˙} = - G (m, σ, ε),

\dot{β} = \frac{1}{β} H (m, σ, ε), \overset{ν}{˙} = - \frac{1}{ν ( ν + 1 )} H (m, σ, ε),

F (m, σ, ε) := \int \frac{z}{2 σ + z ^{2}} p_{c} (y) d y,

F (m, σ, ε) := \int \frac{z}{2 σ + z ^{2}} p_{c} (y) d y,

G (m, α, σ, ε) := \int ln (1 + \frac{z ^{2}}{2 σ}) p_{c} (y) d y + ΔΨ (α),

H (m, α, σ, ε) := \int \frac{α z ^{2} - σ}{2 σ + z ^{2}} p_{c} (y) d y,

\overset{σ}{˙} = (\frac{( ν + 1 ) ^{2}}{ν ^{2} σ} + \frac{σ}{( ν + 1 ) ^{2} ν ^{2}}) H (m, σ, α) .

\overset{σ}{˙} = (\frac{( ν + 1 ) ^{2}}{ν ^{2} σ} + \frac{σ}{( ν + 1 ) ^{2} ν ^{2}}) H (m, σ, α) .

C_{go} := (m_{o} - m_{g})^{4} + 6 (V_{o} - V_{g}) (m_{o} - m_{g})^{2} + 3 (V_{o} - V_{g})^{2} + (μ_{o}^{(4)} - 3 V_{o}^{2}) + 4 μ_{o}^{(3)} (m_{o} - m_{g}) .

C_{go} := (m_{o} - m_{g})^{4} + 6 (V_{o} - V_{g}) (m_{o} - m_{g})^{2} + 3 (V_{o} - V_{g})^{2} + (μ_{o}^{(4)} - 3 V_{o}^{2}) + 4 μ_{o}^{(3)} (m_{o} - m_{g}) .

D_{go} := (m_{o} - m_{g})^{3} + 3 (V_{o} - V_{g}) (m_{o} - m_{g}) + μ_{o}^{(3)} .

D_{go} := (m_{o} - m_{g})^{3} + 3 (V_{o} - V_{g}) (m_{o} - m_{g}) + μ_{o}^{(3)} .

m_{ε} = m_{g} + (m_{o} - m_{g}) ε - \frac{C _{go} D _{go}}{6 V _{g}^{3}} ε^{2} + O (ε^{3}),

m_{ε} = m_{g} + (m_{o} - m_{g}) ε - \frac{C _{go} D _{go}}{6 V _{g}^{3}} ε^{2} + O (ε^{3}),

α_{ε} = \frac{3 V _{g}^{2}}{C _{go}} ε^{- 1} + O (ε^{- 2}), σ_{ε} = \frac{3 V _{g}^{3}}{C _{go}} ε^{- 1} + O (ε^{- 2}) .

α_{ε} = \frac{3 V _{g}^{2}}{C _{go}} ε^{- 1} + O (ε^{- 2}), σ_{ε} = \frac{3 V _{g}^{3}}{C _{go}} ε^{- 1} + O (ε^{- 2}) .

m_{\rm g}=m_{\rm p}+O\big{(}e^{-c/\varepsilon}\big{)}\quad\text{as }\varepsilon\to 0

m_{\rm g}=m_{\rm p}+O\big{(}e^{-c/\varepsilon}\big{)}\quad\text{as }\varepsilon\to 0

V_{g} = (1 - b ε) V_{p} + O (ε^{2}),

V_{g} = (1 - b ε) V_{p} + O (ε^{2}),

(m_{fix}, α_{fix}, β_{fix}, ν_{fix}) \leftarrow (m (w_{fix}, x), α (w_{fix}, x), β (w_{fix}, x), ν (w_{fix}, x))

(m_{fix}, α_{fix}, β_{fix}, ν_{fix}) \leftarrow (m (w_{fix}, x), α (w_{fix}, x), β (w_{fix}, x), ν (w_{fix}, x))

(m^{'}, α^{'}, β^{'}, ν^{'}) \leftarrow (\frac{ν _{fix} m _{fix} + y}{ν _{fix} + 1}, α_{fix} + \frac{1}{2}, β_{fix} + \frac{ν _{fix}}{ν _{fix} + 1} \frac{( y - m _{fix} ) ^{2}}{2}, ν_{fix} + 1) .

(m^{'}, α^{'}, β^{'}, ν^{'}) \leftarrow (\frac{ν _{fix} m _{fix} + y}{ν _{fix} + 1}, α_{fix} + \frac{1}{2}, β_{fix} + \frac{ν _{fix}}{ν _{fix} + 1} \frac{( y - m _{fix} ) ^{2}}{2}, ν_{fix} + 1) .

K (w, x, y) \leftarrow \frac{α ^{'} ( m ( w , x ) - m ^{'} ) ^{2} ν}{2 β ^{'}} + \frac{ν ( w , x )}{2 ν ^{'}} - \frac{1}{2} ln \frac{ν ( w , x )}{ν ^{'}} - \frac{1}{2}

K (w, x, y) \leftarrow \frac{α ^{'} ( m ( w , x ) - m ^{'} ) ^{2} ν}{2 β ^{'}} + \frac{ν ( w , x )}{2 ν ^{'}} - \frac{1}{2} ln \frac{ν ( w , x )}{ν ^{'}} - \frac{1}{2}

- α (w, x) ln \frac{β ( w , x )}{β ^{'}} + ln \frac{Γ ( α ( w , x ))}{Γ ( α ^{'} )} - (α (w, x) - α^{'}) Ψ (α^{'}) + \frac{α ^{'} ( β ( w , x ) - β ^{'} )}{β ^{'}},

(\tilde{ν} (w, x), \tilde{σ} (w, x)) \leftarrow (2 α (w, x), \frac{β ( w , x ) ( ν ( w , x ) + 1 )}{ν ( w , x ) α ( w , x )})

(\tilde{ν} (w, x), \tilde{σ} (w, x)) \leftarrow (2 α (w, x), \frac{β ( w , x ) ( ν ( w , x ) + 1 )}{ν ( w , x ) α ( w , x )})

t (w, x, y) \leftarrow \frac{Γ ( \frac{ν ~ ( w , x ) + 1}{2} )}{Γ ( \frac{ν ~ ( w , x )}{2} ) π ν ~ ( w , x ) σ ~ ( w , x )} (1 + \frac{1}{ν ~ ( w , x )} (\frac{y - m ( w , x )}{σ ~ ( w , x )})^{2})^{- \frac{ν ~ ( w , x ) + 1}{2}}

t (w, x, y) \leftarrow \frac{Γ ( \frac{ν ~ ( w , x ) + 1}{2} )}{Γ ( \frac{ν ~ ( w , x )}{2} ) π ν ~ ( w , x ) σ ~ ( w , x )} (1 + \frac{1}{ν ~ ( w , x )} (\frac{y - m ( w , x )}{σ ~ ( w , x )})^{2})^{- \frac{ν ~ ( w , x ) + 1}{2}}

(x, y) \in (X, Y) \sum L (w, x, y),

(x, y) \in (X, Y) \sum L (w, x, y),

m_{p} (x) := m (w, x), V_{p} (x) := \frac{β ( w , x ) ( ν ( w , x ) + 1 )}{ν ( w , x ) ( α ( w , x ) - A ( α ( w , x )))},

m_{p} (x) := m (w, x), V_{p} (x) := \frac{β ( w , x ) ( ν ( w , x ) + 1 )}{ν ( w , x ) ( α ( w , x ) - A ( α ( w , x )))},

A (α) \approx \frac{2 α}{2 α + 3},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Machine Learning and Algorithms · Gaussian Processes and Bayesian Inference

Full text

Robustness Against Outliers For Deep Neural Networks

By Gradient Conjugate Priors

Pavel Gurevich

Institute of Mathematics,

Free University Berlin

14195 Berlin, Germany

[email protected]

&Hannes Stuke

Institute of Mathematics,

Free University Berlin

14195 Berlin, Germany

[email protected] Peoples’ Friendship University of RussiaEqual contribution

Abstract

We analyze a new robust method for the reconstruction of probability distributions of observed data in the presence of output outliers. It is based on a so-called gradient conjugate prior (GCP) network which outputs the parameters of a prior. By rigorously studying the dynamics of the GCP learning process, we derive an explicit formula for correcting the obtained variance of the marginal distribution and removing the bias caused by outliers in the training set. Assuming a Gaussian (input-dependent) ground truth distribution contaminated with a proportion $\varepsilon$ of outliers, we show that the fitted mean is in a $ce^{-1/\varepsilon}$ -neighborhood of the ground truth mean and the corrected variance is in a $b\varepsilon$ -neighborhood of the ground truth variance, whereas the uncorrected variance of the marginal distribution can even be infinite. We explicitly find $b$ as a function of the output of the GCP network, without a priori knowledge of the outliers (possibly input-dependent) distribution. Experiments with synthetic and real-world data sets indicate that the GCP network fitted with a standard optimizer outperforms other robust methods for regression.

1 Introduction

Development of methods robust against outliers in the observed data is an important direction of machine learning and statistics [17]. One distinguishes between input outliers (i.e., outliers $x$ in the input space) and output outliers (i.e., wrongly labeled samples $y$ ). The former can potentially be detected both during fitting neural networks and when one predicts labels of new data samples. The latter are visible at the fitting stage only and significantly distort the approximate distribution one uses for predictions afterwards. Bayesian neural networks and ensemble methods can naturally detect input outliers at the prediction stage by assigning high uncertainty to them [27, 20]. In order to deal with input outliers at the fitting stage, one can use a covariate shift importance sampling [31, 37], which assumes the knowledge of training and test distributions $p_{\rm train}(x)$ and $p_{\rm test}(x)$ of the input variable and downweights the samples with small ratios $p_{\rm test}(x)/p_{\rm train}(x)$ .

We concentrate on how to mitigate the influence of output outliers at the fitting stage. We will estimate unknown mean and variance of labels111We denote random variables by bold letters and the arguments of their probability distributions by the corresponding non-bold letters. ${\mathbf{y}}\sim p_{\rm g}(y|x)$ (ground truth distribution) in spite of contamination by an outliers distribution $p_{\rm o}(y|x)$ . More specifically, we assume that the labels in the training set have Huber’s contaminated distribution [16]

[TABLE]

where $\varepsilon\in[0,1)$ represents the proportion of outliers. Henceforth, we omit conditioning on $x$ for notational ease. We assume throughout that the ground truth distribution $p_{\rm g}(y)$ is univariate Gaussian with mean $m_{\rm g}$ and variance $V_{\rm g}$ , and we denote by $m_{\rm o}$ and $V_{\rm o}$ the mean and variance of the outliers distribution $p_{\rm o}(y)$ . We do not impose restrictions on $p_{\rm o}(y)$ except for a certain polynomial decay at infinity, see technical assumptions in Sec. 3 and the supplement (Appendix B).

The main contributions of this paper are as follows. 1. We prove that outliers cause a qualitative change in the structure of the energy surfaces of the GCP network (analyzed in [11] in the absence of outliers). Namely, outliers make a global minimum bifurcate from infinity to a finite value (Theorem 3.1). In turn, this renders the predictive distribution from Gaussian into Student’s t, whose variance $V_{\rm St}$ may be significantly larger than the ground truth variance $V_{\rm g}$ . 2. We show how the knowledge of the above finite equilibrium allows one to reconstruct the ground truth mean $m_{\rm g}$ and variance $V_{\rm g}$ (Theorems 4.1 and 4.2).

Our experiments in Sec. 5 with synthetic and real-world data sets indicate that the GCP network, fitted with a standard optimizer (Adam in our case), outperforms other robust methods, particularly by properly estimating the ground truth variance.

1.1 Main idea

For each $x$ in the input space, we define a probabilistic model for a random variable ${\mathbf{y}}$ and latent variables ${\boldsymbol{\mu}},{\boldsymbol{\tau}}$

[TABLE]

where the likelihood $p(y|\mu,\tau)$ is assumed Gaussian with mean $\mu$ and precision $\tau$ , while the latter are treated as random variables ${\boldsymbol{\mu}},{\boldsymbol{\tau}}$ with a normal-gamma distribution $p(\mu,\tau|m,\nu,\alpha,\beta)$ . The parameters $m,\nu,\alpha,\beta$ are functions of the input $x$ , and are represented as outputs of multi-layer neural networks (Fig. 1, left).

The marginal likelihood appears Student’s t-distribution

[TABLE]

In the standard Bayesian approach and $x$ -independent case, one updates $m,\nu,\alpha,\beta$ based on observations $y_{1},\dots,y_{N}$ . However, this is not possible in the neural networks framework because, on one hand, different $y_{j}$ belong to different input points $x_{j}$ and, on the other hand, one cannot update the outputs $m,\nu,\alpha,\beta$ of a neural network directly. The theory of Bayesian neural networks suggests to treat the weights of neural networks as random variables with a certain prior and to approximate their (usually analytically untractable) posterior [25, 36, 3, 19, 8, 14, 23, 22]. Instead, we follow the gradient conjugate prior (GCP) method proposed in [11]. We treat the weights $w$ of the neural networks as deterministic parameters. Given an observation $y_{j}$ corresponding to an input $x_{j}$ , one can explicitly find the parameters $m^{\prime},\nu^{\prime},\alpha^{\prime},\beta^{\prime}$ of the posterior distribution of ${\boldsymbol{\mu}},{\boldsymbol{\tau}}$ . We perform a gradient descent step towards minimization of the Kullback–Leibler (KL) divergence from the posterior to the prior, where the gradient is taken with respect to the weights $w$ of the neural networks representing $m,\nu,\alpha,\beta$ . It is shown in [11] that the GCP update is equivalent to maximizing the marginal log-likelihood $t_{2\alpha}(y|m,\sigma/\alpha)$ . Furthermore, the above update of the weights $w$ induces an update of $m,\nu,\alpha,\beta$ , which allows one to write a dynamical system (in the limit as the learning rate goes to [math]) for the evolution of $m,\nu,\alpha,\beta$ for each input $x$ . This dynamical system takes the form

[TABLE]

where $K=K(m,\nu,\alpha,\beta)$ is the above KL-divergence, $\dot{}=d/dt$ stands for the derivative with respect to fictitious time $t$ and the expectations are taken with respect to the contaminated distribution $p_{\rm c}(y)$ in (1), see details in Sec. 2.

By analyzing system (5), we show that, for small $\varepsilon>0$ , the parameters $m(t),\alpha(t),\sigma(t)$ (see (4)) converge to a finite equilibrium. We denote it by $m,\alpha,\sigma$ again (slightly abusing notation) and set

[TABLE]

We call these quantities the prognistic mean and variance.222As opposed to the predictive variance $V_{\rm St}$ in (9) of marginal Student’s t-distribution $t_{2\alpha}(y|m,\sigma/\alpha)$ in (3). Here $A(\alpha)$ is monotone increasing from [math] to $1$ and satisfies $\alpha-A(\alpha)>0$ for all $\alpha>0$ , see Fig. 4 in the supplement. It is defined as a unique root $A=A(\alpha)$ of the equation

[TABLE]

Due to Lemma C.2 in the supplement, $A(\alpha)$ is well defined for all $\alpha>0$ . We show that, for small $\varepsilon$ , the prognostic mean $m_{\rm p}$ is exponentially close to $m_{\rm g}$ (Theorem 4.1), while the prognostic variance $V_{\rm p}$ is linearly close to $V_{\rm g}$ (Theorem 4.2), namely,

[TABLE]

where $c>0$ and $b=b(\alpha)>0$ is defined in (56) in the supplement. We emphasize the novelty of the prognostic variance $V_{\rm p}$ in (6), which provides a correction of the usually used variance of the marginal distribution (3). In our case, the latter is Student’s t variance

[TABLE]

Note that $V_{\rm St}>V_{\rm p}$ ; moreover, $V_{\rm St}=\infty$ if $\alpha\leq 1$ . Therefore, even though Student’s t-distribution is a popular choice in robust statistics and indeed provides a robust estimate of the mean, it significantly overestimates the ground truth variance in the presence of outliers, yielding an error $O(1)$ . Our analysis allows us to recover the ground truth variance via (6) up to an error $O(\varepsilon)$ due to (8).

A practical algorithm for fitting GCP networks is given in the supplement (Appendix A).

1.2 Related work

There are several related approaches to mitigating the influence of output outliers. One popular approach is based on fitting heavy-tailed distributions, such as Student’s t [21, 24, 29]. Effectively, our GCP method also fits Student’s t-distribution, but additionally it reconstructs the ground truth variance $V_{\rm g}$ via (6) or (8). Localization of a probabilistic model [34] generalizes heavy-tailed distributions. Localization principle allows the likelihood of each sample to depend on its own copy of a latent variable, while all the copies obey the same probability distribution. In particular cases, marginalizing the latent variables gives rise to Student’s t marginal likelihood. Another body of methods uses data reweighting. One can manually assign binary weights to samples [17] or use the Bayesian framework [35], in which the likelihood of each sample is raised to a power being a latent variable. The posterior of these latent variables is inferred together with the posterior of other latent variables in the model. Another type of reweighting is provided by so-called robust divergences, which are used instead of the Kullback–Leibler divergence either in directly approximating the ground truth distribution or in learning the posterior distribution of the parameters. For example, the $q$ -entropy was used in [5], while $\beta$ - and $\gamma$ -divergences were studied in [1, 10, 7]. A number of papers develop robust gradient descent methods by detecting and reweighting the gradients of outliers during backpropagation [15, 28, 39] or by removing outliers from a fitted model followed by refitting [4]. We emphasize that our GCP approach, in contrast to the above methods, can be trained in one run with any standard optimizer (such as Adam, RMSprop, etc.), and it does not require fine tuning additional hyperparameters or explicitly estimating the contamination proportion $\varepsilon$ . On the other hand, knowing $\varepsilon$ , one can reduce the error for the variance estimation to $O(\varepsilon^{2})$ , see (8).

2 The GCP approach

We recall the GCP approach introduced in [11] and outlined in Sec. 1.1.

2.1 GCP update

We describe an update of $m,\nu,\alpha,\beta$ in (2) directly, assuming $x$ to be fixed. We refer to Remark 2.1 and to [11] for details concerning an update of the weights $w$ of neural networks representing $m,\nu,\alpha,\beta$ . Suppose we observe a new sample $y$ . Then, using the Bayes theorem, we find the conditional distribution of $({\boldsymbol{\mu}},{\boldsymbol{\tau}})$ under the condition that ${\mathbf{y}}=y$ . This posterior distribution denoted by $p_{\rm post}(\mu,\tau)$ is also normal-gamma [2], namely, $p_{\rm post}(\mu,\tau)=p(\mu,\tau|m^{\prime},\nu^{\prime},\alpha^{\prime},\beta^{\prime}),$ where the parameters are updated as follows:

[TABLE]

However, in the framework of neural networks, one cannot update $m,\nu,\alpha,\beta$ directly. Instead, we fix $m^{\prime},\nu^{\prime},\alpha^{\prime},\beta^{\prime}$ according to (10) and use the KL divergence from $p_{\rm post}$ to $p$ , see [30]:

[TABLE]

where $\Psi(\alpha):={\Gamma^{\prime}(\alpha)}/{\Gamma(\alpha)}$ is the digamma function and $\Gamma(\alpha)$ is the gamma function. After that, we update $m,\nu,\alpha,\beta$ by performing a gradient descent step in the direction $-\nabla K$ . Recalling that the observations are sampled from the contaminated distribution $p_{\rm c}(y)$ , we can approximate the fitting process by the dynamical system (5).

Remark 2.1.

If $m,\nu,\alpha,\beta$ are parametrized by weights of neural networks, then the gradient of $K$ must be taken with respect to those weights, see the algorithm in the supplement (Appendix A). The dynamics of the weights will induce a dynamics of $m,\nu,\alpha,\beta$ with the right-hand sides that contain the gradients of $m,\nu,\alpha,\beta$ with respect to the weights [11]. However, they will enter as prefactors in (5). Hence any equilibrium of (5) will be an equilibrium of the dynamical system for the weights.

2.2 Explicit dynamical system

Dynamical system (5) can be explicitly written as follows (cf. (3.4)–(3.7) in [11]):

[TABLE]

where

[TABLE]

the integrals are taken over ${\mathbb{R}}$ , $z=y-m$ , $p_{\rm c}(y)$ is defined in (1), and $\Delta\Psi(\alpha):=\Psi(\alpha)-\Psi(\alpha+1/2)$ . Equations (14) imply

[TABLE]

The first goal of this paper is to show that, given outliers ( $\varepsilon>0$ ), fitting the parameters by the GCP method automatically yields finite values of $\alpha$ and $\sigma$ . Theorem 3.1 shows that finite $\alpha$ and $\sigma$ occur via bifurcation at infinity as $\varepsilon$ becomes nonzero. The second goal is to show that the obtained prognostic mean $m_{\rm p}$ and variance $V_{\rm p}$ in (6) do approximate the ground truth mean and variance in the sense of (8) given the output $m,\alpha,\beta,\nu$ of a fitted GCP network. This is done in Theorems 4.1 and 4.2.

3 Bifurcation of predictive distribution from Gaussian to Student’s t

In this section, we show that an arbitrarily small percentage of outliers qualitatively changes the dynamics of (12)–(14), (18), namely, it makes $\alpha$ and $\sigma$ converge to finite values. This changes the predictive distribution from Gaussian to Student’s t. We will prove that this happens via bifurcation of the equilibrium $\alpha,\sigma$ at infinity (Fig. 1, right). In the next section, we show that the correction given by the prognostic variance $V_{\rm p}$ in 6 is $\varepsilon$ -close to the ground truth variance $V_{\rm g}$ .

Denote by $\mu_{\rm o}^{(k)}$ the $k$ th central moment of $p_{\rm o}(y)$ . The following technical assumption requires that the mean $m_{\rm o}$ or the variance $V_{\rm o}$ of outliers be large enough, or $p_{\rm o}(y)$ have heavy tails. It is used only in this section and does not depend on $\varepsilon$ .

Condition 3.1.

The outliers distribution $p_{\rm o}(y)$ satisfies $C_{\rm go}>0$ , where

[TABLE]

We will see that $C_{\rm go}$ plays a role of an indicator of outliers. The larger $C_{\rm go}$ is compared with $V_{\rm g}$ , the better the GCP method recognizes samples from $p_{\rm o}(y)$ as outliers and the better it filters them out. A similar role of an indicator will be played by the absolute value of the constant

[TABLE]

Theorem 3.1.

Let Condition 3.1 hold with some $m_{\rm g},V_{\rm g},m_{\rm o},V_{\rm o}$ . Then, for all sufficiently small $\varepsilon>0$ , there exists a unique equilibrium $m_{\varepsilon},\alpha_{\varepsilon},\sigma_{\varepsilon}$ of system (12), (13), (18). The following asymptotics is true as $\varepsilon\to 0$ :

[TABLE]

Theorem 3.1 is proved in the supplement.

4 Prognostic mean and variance

The main practical question we answer in this section is the following. Given a finite equilibrium $(m,\alpha,\sigma)$ (as observed after the model is fitted), what can we tell about the ground truth mean $m_{\rm g}$ and variance $V_{\rm g}$ ? Due to (6), the equilibrium $(m,\alpha,\sigma)$ uniquely determines prognostic mean $m_{\rm p}$ and variance $V_{\rm p}$ . Thus, for each $\varepsilon$ , there remain 4 unknowns $m_{\rm o},V_{\rm o},m_{\rm g},V_{\rm g}$ in the 3 equations $F=G=H=0$ (see (28)–(30)). In this section, we assume they are functions of $\varepsilon$ and obtain their asymptotics for small $\varepsilon$ under the following condition.

Condition 4.1.

Either $m_{\rm o}(\varepsilon)$ or $V_{\rm o}(\varepsilon)$ is constant in $\varepsilon$ .

The next theorem shows that the prognostic mean $m_{\rm p}$ is exponentially close to $m_{\rm g}$ .

Theorem 4.1.

Let $p_{\rm o}(y):=\frac{1}{\sqrt{V_{\rm o}}}\tilde{p}_{\rm o}\left(\frac{y-m_{\rm o}}{\sqrt{V_{\rm o}}}\right)$ , where $\tilde{p}_{\rm o}(y)$ is an arbitrary distribution with zero mean and unit variance. Let $(m,\alpha,\sigma)$ be an equilibrium (independent of $\varepsilon$ ) for system (12), (13), (18). Let $V_{\rm g}(\varepsilon)$ be bounded for all small $\varepsilon$ . Then there is an equilibrium $m_{\rm g}(\varepsilon)$ of Eq. (12) such that

[TABLE]

for some $c>0$ that does not depend on $\varepsilon$ and $m_{\rm g}$ .

The proof is given in the supplement (Appendix C).333Theorem 4.1 is proved under the assumption that either $|m_{\rm o}(\varepsilon)|$ or $V_{\rm o}(\varepsilon)$ is bounded for small $\varepsilon$ , which is weaker than Condition 4.1.

Next, we analyze how much the prognostic variance $V_{\rm p}$ in (6) differs from the ground truth variance $V_{\rm g}$ . Theorem 4.1 shows that the equilibrium $m=m_{\rm p}$ of (12) is exponentially close to $m_{\rm g}$ . Therefore, to simplify our next statement and the technicalities of its proof, we assume that $m=m_{\rm g}$ .

Theorem 4.2.

Let $p_{\rm o}(y):=\frac{1}{\sqrt{V_{\rm o}}}\tilde{p}_{\rm o}\left(\frac{y-m_{\rm o}}{\sqrt{V_{\rm o}}}\right)$ , where $\tilde{p}_{\rm o}(y)$ is an arbitrary distribution with zero mean and unit variance. Let $(\alpha,\sigma)$ be an equilibrium (independent of $\varepsilon$ ) for system (13), (18) with $m=m_{\rm g}$ . Then

[TABLE]

*where $b=b(\alpha)>0$ is defined in $(56)$ in the supplement. *

The proof is given in the supplement (Appendix D). Moreover, we prove therein that any finite $(\alpha,\sigma)$ is realizable as an equilibrium for some $m_{\rm g},V_{\rm g},m_{\rm o},V_{\rm o},\varepsilon$ .

Asymptotics (24) should be compared with Student’s t variance $V_{\rm St}$ in (9), which yields an error of order $1$ if $\alpha>1$ and an infinite error if $\alpha\leq 1$ .

5 Experiments

5.1 Methods

We compare the following robust methods:444Our preliminary results with robust gradient descent in [15] and [39] were significantly worse than those obtained by the other methods, especially in case of input-dependent variance in the loss. Therefore we do not include them in Table 1. We did not implement the robust gradient estimation in [28] because fine tuning its hyperparameters requires the knowledge of $\varepsilon$ , see Sec. 3.3 therein.

Beta and Gamma: the methods in which one minimizes, respectively, the $\beta$ - and $\gamma$ -divergences from the ground truth to the approximating normal distribution [1, 10, 6]; 2. 2.

BetaBayes: the robust Bayesian method based on the $\beta$ -divergence555We performed a grid search for $\beta$ and the (input-independent) standard deviation of the likelihood. By varying these two parameters, one obtains the same set of loss functions as by varying $\gamma$ and the standard deviation in the robust Bayesian method in [7] based on the $\gamma$ -divergence. Therefore, we do not include the latter method as a separate one in our comparison list. [7]; 3. 3.

GCPSt: the GCP with the Student’s t variance $V_{\rm St}$ , https://github.com/hstuk/GCP; 4. 4.

GCP: the GCP with the prognostic variance $V_{\rm p}$ , https://github.com/hstuk/GCP; 5. 5.

EnsBeta, EnsGamma, EnsGCP: ensembles of 5 Beta, Gamma, and GCP respectively.

Note that the Beta, Gamma and the GCP-based methods estimate aleatoric uncertainty since they learn the variance of labels conditioned on the input $x$ , while the Bayesian method BetaBayes estimates epistemic uncertainty since the variance of the likelihood is treated as a hyperparameter, while the predictive variance is $x$ -dependent only due to randomness in the weights. The ensemble methods are supposed to learn both aleatoric and epistemic uncertainty, and their overall variance is computed as the variance of the Gaussian mixture distribution, cf. [20]. Architectures and hyperparameters for all methods are given in the supplement.

5.2 Synthetic data set

We generate a synthetic data set containing 5% of outliers. To do so, we choose the set $X$ consisting of 400 points uniformly distributed on the interval $(-1,1)$ . For each $x\in X$ , with probability 0.95 we sample $y$ from the normal distribution with mean $\sin(3x)$ and standard deviation $0.5\cos^{4}x$ , and with probability 0.05 we sample $y$ from a uniform distribution on the interval $(-4,16)$ . Figure 2 shows the data and the fits for different methods. Even though the means are accurately predicted by most robust methods, the GCP learns the variance best. Furthermore, the output $\alpha$ of the GCP network provides additional information, namely, small values of $\alpha$ indicate that the corresponding samples belong to a (less trust-worthy) region in which the training set contained outliers.

5.3 Real world data sets

Data sets. We analyze the following publicly available data sets: Boston House Prices [13] ( $506$ samples, 13 features), Concrete Compressive Strength [38] ( $1030$ samples, 8 features), Combined Cycle Power Plant [33, 18] ( $9568$ samples, 4 features), Yacht Hydrodynamics [9, 26] (308 samples, 6 features), and Kinematics of an 8 Link Robot Arm Kin8Nm666http://mldata.org/repository/data/viewslug/regression-datasets-kin8nm/ (8192 samples, 8 feature). Each data set is randomly split into train-test subsets with 95% of samples in the training subset. For each training set, we randomly choose $\lambda$ % of samples and replace them by outliers. The outliers are sampled from the Gaussian distribution with the mean equal to the mean over all the targets in the original training set and standard deviation equal to ten times the standard deviation over the targets in the original training set. All the results reported below are the respective averages over 50 cross-validations.

Measures. We use two measures of the quality of the fit.

The overall root mean squared error (RMSE).
The area under the following curve (AUC), measuring the trade-off between properly learning the mean and the variance. Assume the test set contains $N$ samples. We order them with respect to their predicted variance. For each $n=0,\dots,N-1$ , we remove $n$ samples with the highest variance and calculate the RMSE for the remaining $N-n$ samples (with the lowest variance). We denote it by ${\rm RMSE}(n)$ and plot it versus $n$ as a continuous piecewise linear curve. The second measure is the area under this curve normalized by $N-1$ : ${\rm AUC}:=\frac{1}{N-1}\sum\limits_{n=0}^{N-2}\frac{{\rm RMSE}(n)+{\rm RMSE}(n+1)}{2}.$

Results. Table 1 presents777Symbol $*$ indicates that we were not able to fine tune the parameters of the BetaBayes to obtain reasonable predictions for Power and Kin8nm data sets. Note that the authors in [7] used a protocol for fitting BetaBayes different from ours. Unlike us, they first normalized the noncontaminated training set and then added outliers. RMSE and AUC scores for the outliers’ percentage $\lambda=0,5,10,15,20$ , respectively. In each column, we mark a method in bold if it is significantly (due to the two-tailed paired difference test with $p=0.05$ ) better or indistinguishable from all the other methods. We see that the GCP significantly improves AUC scores of the GCPSt in the presence of outliers. Furthermore, EnsGCP yields the best AUC among all the methods for all $\lambda$ , see also Fig. 3. Thus, it provides the best trade-off between properly learning the mean and the variance. Its RMSE score is competitive or superior to the other methods. Moreover, after removing a small number of samples for which EnsGCP predicts a high variance, its RMSE for the remaining samples becomes significantly better than the respective RMSE of the other methods, see the curves ${\rm RMSE}(n)$ in Fig. 5 in the supplement.

6 Conclusion

We analyzed the minima of the energy surfaces of the GCP networks encoding the priors of latent variable models. Under the assumption of Huber’s $\varepsilon$ -contamination of the Gaussian ground truth distribution $p_{\rm g}(y|x)$ , we obtained formulas for prognostic mean $m_{\rm p}(x)$ and variance $V_{\rm p}(x)$ in terms of the outputs of the GCP networks, yielding errors for the ground truth mean $m_{\rm g}(x)$ and variance $V_{\rm g}(x)$ of order $O(e^{-c/\varepsilon})$ and $O(\varepsilon)$ respectively.

The GCP networks can be trained with standard optimizers (such as Adam, RMSProp, etc.) and do not require fine tuning additional hyperparameters. Experiments with synthetic and real world data with outliers showed their superiority over several other state-of-art robust methods based on neural networks.

Appendix A Algorithm for fitting a GCP network and predicting the mean and variance of the ground truth distribution

In this section, we present a practical algorithm for defining a loss of a GCP network, fitting it, and predicting the mean and variance of the ground truth distribution in a robust way. The code is available at https://github.com/hstuk/GCP.

Given an input $x\in\mathbb{R}^{d}$ and a vector of weights $w$ , we denote the $4$ -dimensional output of the GCP network by $m(w,x),\alpha(w,x),\beta(w,x),\nu(w,x)$ . The outputs can share the weights or have independent weights, in which case $w=(w_{m},w_{\alpha},w_{\beta},w_{\nu})$ . For each labeled sample $(x,y)$ with $x\in\mathbb{R}^{d}$ , $y\in\mathbb{R}$ , we define a loss $L(w,x,y)$ according to Algorithm 1 or Algorithm 2. According to [11, Lemma 2.1], these two algorithms yield the same loss up to an additive constant not depending on $w$ .

Given the loss $L(w,x,y)$ defined in Algorithm 1 or 2 and a training set $(X,Y)$ , we fit the GCP network by minimizing

[TABLE]

using any standard optimizer (e.g., Adam, RMSProp, etc.). Once the GCP network is fitted, we predict the mean and variance of the ground truth distribution $p_{\rm g}(y|x)$ as follows (see Eq. (6)):

[TABLE]

where $A(\alpha)$ is defined as a unique root of Eq. (7). The function $A(\alpha)$ can be precalculated in advance or, due to [11], approximated by

[TABLE]

see Fig. 4.

Remark A.1.

The fitted GCP network minimizes the log-likelihood of Student’s t-distribution $p(y|x,m(w,x),\tilde{\nu}(w,x),\tilde{\sigma}(w,x))$ , see Algorithm 2. One can rewrite the above prognostic variance $V_{\rm p}(x)$ in terms of $\tilde{\nu}(w,x),\tilde{\sigma}(w,x)$ , namely

[TABLE]

This approach would reduce the $4$ -dimensional output of the GCP network to the $3$ -dimensional output directly encoding the parameters of Student’s t distribution. However, the resulting dynamics of the weights $w$ and the induced dynamics of $m(w,x),\tilde{\sigma}(w,x),\tilde{\sigma}(w,x)$ (a counterpart for dynamical system (12)–(14)) is an open question, which is a direction for future research.

Appendix B Proof of Theorem $3.1$

We assume throughout the proof that $p_{\rm o}(y)$ is continuously differentiable, its sixth central moment exists, and there is $C>0$ such

[TABLE]

and

[TABLE]

for all $M\geq 1$ .

Without loss of generality, assume that

[TABLE]

First, we show that system (12)–(14) has at least one equilibrium $m_{\varepsilon},\alpha_{\varepsilon},\beta_{\varepsilon},\nu_{\varepsilon}$ . To do so, it suffices to prove that the system of equations

[TABLE]

(where the integrals are take over ${\mathbb{R}}$ , $z=y-m$ , $p_{\rm c}(y)$ is defined in (1), and $\Delta\Psi(\alpha):=\Psi(\alpha)-\Psi(\alpha+1/2)$ ) has a root $m_{\varepsilon},\alpha_{\varepsilon},\sigma_{\varepsilon}$ .

First, we solve Eq. (30) with respect to $\alpha$ . Setting $\delta=1/(2\sigma)$ , we have for $z^{2}<2\sigma$ :

[TABLE]

Hence, additionally using the decay of $p_{\rm c}(y)$ at infinity to estimate the integral for $z^{2}>2\sigma$ , we obtain

[TABLE]

Here

[TABLE]

the functions $f_{j}(m,\delta,\varepsilon)$ for $j=0,1,2$ (and $j=3,4$ below) are smooth for $\varepsilon\in[0,1]$ and $m,\delta$ in a neighborhood of the origin, and their partial derivatives with respect to $m$ and $\varepsilon$ are $O(\delta^{4})$ as $\delta\to 0$ uniformly for $\varepsilon\in[0,1]$ and $m$ in a neighborhood of the origin.

Solving $H(m,\alpha,\sigma,\varepsilon)=0$ for $\alpha$ yields

[TABLE]

where, for brevity, we omitted the dependence of the functions on their arguments.

Using the Taylor formula for the logarithm and the asymptotic expansion of $\Psi(\alpha)$ , we have

[TABLE]

Plugging in $\alpha$ given by (31) into (32) and dividing by $\delta^{2}$ , we see that, for $\delta>0$ , system (28)–(30) is equivalent to

[TABLE]

We solve system (33)–(35) with respect to $m,\delta$ , using the implicit function theorem. Note that $F_{1}(0,0,0)=\int yp_{\rm g}(y)\,dy=0$ due to (27) and $F_{2}(0,0,0)=K(0,0,0)-3V^{2}(0,0,0)=0$ since $V(0,0,0)$ and $K(0,0,0)$ are the second and the fourth central moments of the Gaussian distribution $p_{\rm g}(y)$ . At $(m,\delta,\varepsilon)=(0,0,0)$ , we have

[TABLE]

The vector of $\varepsilon$ -derivatives at $(0,0,0)$ is

[TABLE]

Hence, by the implicit function theorem, there exist $m_{1},\delta_{1},\varepsilon_{1}>0$ such that for any $\varepsilon\in[0,\varepsilon_{1}]$ , system (33), (35) has a unique root $m_{\varepsilon},\delta_{\varepsilon}$ in the set

[TABLE]

Moreover, $m_{\varepsilon},\delta_{\varepsilon}$ are smooth functions of $\varepsilon$ and

[TABLE]

In particular, (37) shows that $\delta_{\varepsilon}>0$ and hence $\sigma_{\varepsilon}>0$ . Combining (37) with (31) proves asymptotics (22). To prove asymptotics (21), we substitute $m_{\varepsilon}=m_{\rm o}\varepsilon+M\varepsilon^{2}+O(\varepsilon^{3})$ and $\delta_{\varepsilon}=\frac{C_{\rm go}}{6V_{\rm g}^{3}}\varepsilon+O(\varepsilon^{2})$ into (33). This yields

[TABLE]

where $\omega_{\rm o}^{(3)}$ is the third moment about [math] for the outliers distribution $p_{\rm o}(y)$ . Rewriting $\omega_{\rm o}^{(3)}$ via the central moments, we see that the constant $M$ equals the coefficient at $\varepsilon^{2}$ in (21).

It remains to show that system (12)–(14) has no other equilibrium except for that found in part 1 of the proof. Assume, to the contrary, that there is a sequence $\varepsilon_{n}\to 0$ and the respective sequence of solutions $(m_{n},\alpha_{n},\sigma_{n})$ of system (28)–(30) that is different for each $\varepsilon_{n}$ from those in part 1 of the proof.

First, we show that there exists $\tilde{m}$ (independent of $\sigma>0$ and $\varepsilon\in[0,1]$ ) such that $|m_{n}|\leq\tilde{m}$ . Assume this is not true. First consider the case where $\sigma_{n}$ is bounded. Let $m_{n}\to-\infty$ (the case $m_{n}\to\infty$ ) is treated similarly. We rewrite Eq. (28) as follows:

[TABLE]

where

[TABLE]

Using (26) and (25), we have

[TABLE]

where $C_{1},C_{2}>0$ do not depend on $n$ . Further, we choose $M>0$ such that

[TABLE]

for all $n>0$ . Then, using the assumption that $\sigma_{n}$ is bounded, we have for all sufficiently large $n$

[TABLE]

where $C_{3}>0$ does not depend on $n$ . Relations (39)–(41) contradict (38).

Consider the case $\sigma_{n}\to\infty$ . Then $\delta_{n}:=1/(2\sigma_{n})\to 0$ , and we rewrite Eq. (28) as follows:

[TABLE]

Then

[TABLE]

where $g_{1}(\delta,\varepsilon,m),g_{2}(\delta,\varepsilon,m)\to 0$ as $(\delta,\varepsilon)\to 0$ uniformly with respect to $m\in{\mathbb{R}}$ . This again contradicts the assumption $m_{n}\to\infty$ . Thus, any root of Eq. (28) indeed satisfies $|m|\leq\tilde{m}$ .

Further, we show that $\sigma_{n}$ is bounded away from [math]. Assume, to the contrary, that (possibly after passing to a subsequence) $\sigma_{n}\to 0$ . Then, due to (30), $\alpha_{n}\to 0$ . Expressing $\alpha$ via $\sigma$ in (30) and using the fact that $m_{n}$ is bounded, we immediately see that $\alpha_{n}\leq c_{1}\sqrt{\sigma_{n}}$ for all sufficiently large $n$ , where $c_{1}>0$ does not depend on $n$ . On the other hand, (29) is equivalent to

[TABLE]

Since $m_{n}$ is bounded, the latter equality yields $\alpha_{n}\geq c_{2}/\ln(\sigma^{-1})$ for all sufficiently large $n$ , where $c_{2}>0$ does not depend on $n$ . This contradicts the first inequality for $\alpha_{n}$ .

Due to part 2, we can assume (possibly after passing to a subsequence) that $m_{n}\to\tilde{m}$ for some $\tilde{m}$ . If $\sigma_{n}$ is bounded, then (possibly after passing to a subsequence) $\sigma_{n}\to\tilde{\sigma}$ and $\tilde{\sigma}>0$ due to part 3. Then by Theorem 3.1 in [11], $\tilde{m}=m_{g}=0$ . Furthermore, since $\sigma_{n}\to\tilde{\sigma}>0$ , it follows from (30) that $\alpha_{n}\to\tilde{\alpha}>0$ . Thus, $\tilde{\sigma},\tilde{\alpha}>0$ solve the equations (28), (30) with $m=0$ . However, by Theorem 3.2, item (c) in [11], the system of these two equations has no solution for $\sigma,\alpha>0$ . Therefore, $\sigma_{n},\alpha_{n}\to\infty$ , and for sufficiently large $n$ , they enter a region where, by part 1, the solution $(\varepsilon_{n},m_{n},\alpha_{n},\sigma_{n})$ is unique.

Appendix C Proof of Theorem $4.1$

For the proof of Theorem 4.1, we need two auxiliary results, which are given in the next two subsections.

C.1 Prognostic mean for any fixed $\varepsilon$

In this subsection, we assume that $\varepsilon$ is fixed and is not necessarily small, and analyze how the equilibrium $m=m_{\rm p}$ of Eq. (12) gets perturbed compared with the ground truth mean $m_{\rm g}$ , provided that $m_{\rm o}$ or $V_{\rm o}$ is large. We will see that the larger the values of $|m_{\rm o}-m_{\rm g}|$ or $V_{\rm o}$ are, the better the samples from $p_{\rm o}(y)$ are recognized as outliers and the stronger $m_{\rm p}$ gets shifted towards $m_{g}$ .

Lemma C.1.

Let $p_{\rm o}(y):=\frac{1}{\sqrt{V_{\rm o}}}\tilde{p}_{\rm o}\left(\frac{y-m_{\rm o}}{\sqrt{V_{\rm o}}}\right)$ , where $\tilde{p}_{\rm o}(y)$ is an arbitrary distribution with zero mean and unit variance. We fix $\varepsilon_{*}\in[0,1)$ and $\alpha,\sigma>0$ . Then the following hold for all $\varepsilon\in[0,\varepsilon_{*}]$ .

If $|m_{\rm o}-m_{\rm g}|$ is large enough, then Eq. (12) has an equilibrium $m_{\rm p}$ in a neighborhood of $m_{\rm g}$ satisfying

[TABLE]

as $|m_{\rm o}-m_{\rm g}|\to\infty,$ where

[TABLE] 2. 2.

If $V_{\rm o}$ is large enough, then Eq. (12) has an equilibrium $m_{\rm p}$ in a neighborhood of $m_{\rm g}$ satisfying, for any $\varkappa>0$ ,

[TABLE]

In both cases, $O(\cdot)$ is uniform with respect to $\varepsilon\in[0,\varepsilon_{*}]$ , $m_{\rm g}\in{\mathbb{R}}$ , and $V_{\rm g}$ from bounded intervals.

Proof.

Without loss of generality, assume that $m_{\rm g}=0$ .

Proof of item 1.

We set $\lambda=1/m_{\rm o}$ and apply the implicit function theorem to

[TABLE]

We have $f(0,0)=0$ . Integrating by parts yields

[TABLE]

where $c_{1}$ is defined in (43). Further, $\partial_{\lambda}f(0,0)=\varepsilon$ . Hence, there is a neighborhood of $(0,0)$ in which Eq. (45) has a unique root $m_{\rm p}=m_{\rm p}(\lambda)$ for each fixed $\lambda$ , and

[TABLE]

Finally, one can check that the second derivatives of $f(m,\lambda)$ are continuous in a neighborhood of $(0,0)$ , which implies the Taylor expansion of $m_{\rm p}(\lambda)$ equivalent to (42). ∎

Proof of item 2.

We fix an arbitrary $\varkappa>0$ and set $\delta=V_{\rm o}^{\frac{-1}{4+2\varkappa}}$ , so that $\delta^{2+\varkappa}=V_{\rm o}^{-1/2}$ . We will apply the implicit function theorem to

[TABLE]

We have $g(0,0)=0$ , $\partial_{m}g(0,0)=-(1-\varepsilon)c_{1}$ , $\partial_{\delta}g(0,0)=0$ . Hence, there is a neighborhood of $(0,0)$ in which Eq. (46) has a unique root $m_{\rm p}=m_{\rm p}(\delta)$ for each fixed $\delta$ , and

[TABLE]

Furthermore, one can check that the second partial derivatives of $g(m,\delta)$ are continuous in a neighborhood of $(0,0)$ . Therefore, $m_{\rm p}=O(\delta^{2})$ as $\delta\to 0$ . Since $\varkappa>0$ is arbitrary, the latter asymptotics is equivalent to (44). ∎

C.2 An auxiliary algebraic relation

For the reader’s convenience, we formulate the following lemma, which is proved in [11, Lemma 3.1]

Lemma C.2.

For each $\alpha>0$ , the equation

[TABLE]

with respect to $A$ has a unique root $A(\alpha)$ . The function $A(\alpha)$ is monotone increasing from [math] to $1$ and satisfies $\alpha-A(\alpha)>0$ for all $\alpha>0$ , see Fig. 4.

It implies the following corollary.

Corollary C.1.

For each $\alpha,\sigma>0$ , the equation

[TABLE]

with respect to $V$ has a unique root $V=V_{\rm p}=\frac{\sigma}{\alpha-A(\alpha)}$ , where $A(\alpha)$ is defined in Lemma C.2.

C.3 Proof of Theorem $4.1$

We set $\tilde{m}_{\rm p}(\varepsilon):=m_{\rm p}-m_{\rm g}(\varepsilon)$ and $\tilde{m}_{\rm o}(\varepsilon):=m_{\rm o}(\varepsilon)-m_{\rm g}(\varepsilon)$ .

Then that $\tilde{m}_{\rm p}$ satisfies

[TABLE]

where $f(\cdot)$ is uniformly bounded with respect to all its arguments. Further, $V_{\rm g}=V_{\rm g}(\varepsilon)$ is bounded by assumption, and Eq. (49) with the zero right hand side has a unique solution $\tilde{m}_{\rm p}=0$ . Therefore, there exists $\tilde{m}_{\rm p}\to 0$ as $\varepsilon\to 0$ uniformly with respect to $\tilde{m}_{\rm o}\in{\mathbb{R}}$ , $V_{\rm o}>0$ , and $V_{\rm g}$ .

Due to (13), (18), the equilibrium $(\alpha,\sigma)$ satisfies

[TABLE]

Note that the functions $G$ and $H$ coincide with those in (29) and (30) (the latter up to a sign), but here we explicitly indicate their dependence on $V_{\rm g}$ , $\tilde{m}_{\rm o}$ , $V_{\rm o}$ , and $\tilde{m}_{\rm p}$ .

Using Corollary C.1 and the fact that $\tilde{m}_{\rm p}\to 0$ , we can pass to the limit in (51) as $\varepsilon\to 0$ , and we see that $V_{\rm g}(\varepsilon)\to V_{\rm p}$ . Hence, passing to the limit in (50), we have

[TABLE]

where

[TABLE]

Since $V_{\rm o}(\varepsilon)$ or $\tilde{m}_{\rm o}(\varepsilon)$ are bounded by assumption, we obtain $\tilde{m}_{\rm o}=e^{\frac{b_{0}+o(1)}{2\varepsilon}}$ or $V_{\rm o}=e^{\frac{b_{0}+o(1)}{\varepsilon}}$ , respectively. Combining this with Lemma C.1 concludes the proof.

Appendix D Proof of Theorem $4.2$

In the formulation of Theorem 4.2, we use the constant

[TABLE]

In the proof, we will also need the constant

[TABLE]

Note that after substituting $V_{\rm p}$ given by (6), the variable $\sigma$ cancels. Thus $b$ and $b_{1}$ are indeed functions of $\alpha$ only, with $\lim\limits_{\alpha\to 0}b(\alpha)=2$ .

We will prove the theorem under the assumption that $V_{\rm o}$ does not depend on $\varepsilon$ . The case where $m_{\rm o}$ does not depend on $\varepsilon$ is analogous.

Without loss of generality, assume that $m=m_{\rm g}=0$ . Due to (13), (18), the equilibrium $(\alpha,\sigma)$ satisfies

[TABLE]

Note that the functions $G$ and $H$ are the same as in (50) and (51), but we omit the dependence on $m_{\rm p}$ , which is assumed to coincide with $m_{\rm g}$ .

We will show that one can find unique roots $V_{\rm g}=V_{\rm g}(\varepsilon)$ and $m_{\rm o}=m_{\rm o}(\varepsilon)$ of Eq. (55) and (56) as functions of $\varepsilon$ (and the other parameters) and determine their asymptotics, provided $\varepsilon$ is small. First, assume that $V_{\rm g}=V_{\rm g}(\varepsilon)$ and $m_{\rm o}=m_{\rm o}(\varepsilon)$ exist for all sufficiently small $\varepsilon$ . Then $V_{\rm g}(\varepsilon)$ is bounded as $\varepsilon\to 0$ . Otherwise, passing to the limit in (56), we would obtain $2\alpha=0$ . Furthermore, it is bounded away from zero. Otherwise, passing to the limit in (56), we would obtain $-1=0$ . Thus, in what follows, it suffices to consider $V_{\rm g}$ from a bounded interval separated from zero.

We introduce the variable $\mu$ instead of $m_{\rm o}$ such that $m_{\rm o}=m_{\rm o}(\varepsilon,\mu)=(2\sigma)^{1/2}e^{b_{1}/2}e^{b_{0}/(2\varepsilon)}(1+\mu)$ and prove existence of $\mu(\varepsilon),V_{\rm g}(\varepsilon)$ . Here $b_{0}$ is given by (52) and $b_{1}$ by (54).

First, we solve Eq. (56) for $V_{\rm g}=V_{\rm g}(\varepsilon,\mu)$ . Consider the function $\tilde{H}(\varepsilon,\mu,V_{\rm g}):=H(\alpha,\sigma,\varepsilon,V_{\rm g},m_{\rm o}(\varepsilon,\mu))$ . Note that there is $\varepsilon_{1}\in(0,1)$ independent of $\mu$ such that for all $\varepsilon\in[0,\varepsilon_{1}]$ and $\mu\in{\mathbb{R}}$ ,

[TABLE]

and $\tilde{H}(\varepsilon,\mu,V_{\rm g})$ is monotone with respect to $V_{\rm g}$ . Hence, Eq. (56) has a unique root $V_{\rm g}=V_{\rm g}(\varepsilon,\mu)$ for all $\varepsilon\in[0,\varepsilon_{1}]$ and $\mu$ , and, due to Corollary C.1, $V_{\rm g}(\varepsilon,\mu)=V_{\rm p}+o(1)$ , where $o(1)$ is uniform with respect to all $\mu\in{\mathbb{R}}$ . The partial derivatives of $\tilde{H}$ with respect to all its arguments are continuous for all $\varepsilon\in[0,\varepsilon_{1}]$ , and $\mu,V_{\rm g}$ . Furthermore, as $\varepsilon\to 0$ , we have

[TABLE]

Hence, by the implicit function theorem, $V_{\rm g}(\varepsilon,\mu)$ is continuously differentiable with respect to $\varepsilon,\mu$ for all $\varepsilon\in[0,\varepsilon_{1}]$ and $\mu\in{\mathbb{R}}$ . In particular,

[TABLE]

where $b$ is defined in Eq. (53).

We substitute $m_{\rm o}=m_{\rm o}(\varepsilon,\mu)$ and $V_{\rm g}=V_{\rm g}(\varepsilon,\mu)$ into (55), and obtain the equation

[TABLE]

where

[TABLE]

Note that

[TABLE]

where

[TABLE]

and $b$ is defined in (27). Therefore,

[TABLE]

where $b_{0}$ and $b_{1}$ are defined in (52) and (54) and $f_{3},\partial_{\mu}f_{3},\partial_{\varepsilon}f_{3}$ are bounded and continuous for $\varepsilon\in[0,\varepsilon_{1}]$ and all $\mu\in{\mathbb{R}}$ .

Further,

[TABLE]

where

[TABLE]

Combining (58)–(61), we see that, for $\varepsilon>0$ , Eq. (58) is equivalent to

[TABLE]

We have $\hat{G}(0,0)=0$ , the partial derivatives of $\hat{G}$ are continuous for all $\varepsilon\in[0,\varepsilon_{1}]$ and $\mu\in{\mathbb{R}}$ , and $\partial_{\mu}\hat{G}(0,0)>0$ . Hence, by the implicit function theorem, there exist small $\varepsilon_{*}>0$ and $\mu_{*}\in{\mathbb{R}}$ such that Eq. (62) has a unique solution $\mu=\mu(\varepsilon)$ for all $\varepsilon\in[0,\varepsilon_{*}]$ , $\mu\in[-\mu_{*},\mu_{*}]$ . This solution is continuously differentiable in a neighborhood of the origin. Similarly, there is a unique solution $\mu=\mu(\varepsilon)$ for all $\varepsilon\in[0,\varepsilon_{*}]$ , $\mu\in[-2-\mu_{*},-2+\mu_{*}]$ . To prove that there are no solutions outside of these two $\mu$ -regions, one can show that $f_{4}(\varepsilon,\mu)$ is monotone decreasing for $\mu\in(-\infty,-1+\mu_{1}(\varepsilon))$ , monotone increasing for $\mu\in(-1+\mu_{1}(\varepsilon),\infty)$ , and $\mu_{1}(\varepsilon)\to 0$ . This proves (29). Applying the chain rule to $V_{\rm g}(\varepsilon,\mu(\varepsilon))$ also yields (28).

Appendix E The RMSE( $n$ ) curves

Figure 5 shows the RMSE( $n$ ) curves for different methods and data sets from Sec. 6.3, fitted on training sets contaminated by 5% of outlier. We see that (possibly after removing a small number of samples for which EnsGCP predicts a high variance) its RMSE is significantly better than the respective RMSE of the other methods, see the curves ${\rm RMSE}(n)$ in Fig. 5 in the supplement.

Appendix F Architectures and hyperparameters

F.1 Architectures

We use one-hidden layer networks with 50 ReLU nonlinearities for the Beta, Gamma, and GCP. Whenever a method uses several quantities (e.g., the mean and variance in the Beta and Gamma, or $m,\nu,\alpha,\beta$ in the GCP), we approximate each quantity by a separate network. For regularization in non-Bayesian methods, we use a dropout layer between the hidden layer and the output unit. Our approach is directly applicable to neural networks of any depth and structure, however we keep one hidden layer for the compatibility of our validation with [14, 20, 12]. For BetaBayes, we used the architecture from the authors’ code888https://github.com/futoshi-futami/Robust_VI.

F.2 Hyperparameters

When we fit all the methods except the BetaBayes, we first contaminate the training set by outliers and then normalize it such that the input features and the targets have zero mean and unit variance. For the BetaBayes, significantly better results were achieved without normalizing the targets999Note that the authors in [7] used a different protocol, namely they normalized the noncontaminated training set and then added outliers to it..

For the Beta, Gamma, and GCP, we used minibatch 5 on Boston, Concrete, and Yacht, and minibatch 10 on Power and Kin8nm. We used Adam (with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ ) optimizer for fitting, and performed a grid search for the learning rate in the range $\{0.00002,0.00005,0.0001,0.0002,0.0007,0.001,0.005\}$ and for the dropout rate in the range $\{0,0.1,0.2,0.3,0.4\}$ . For the Beta and Gamma, we optimized for the learning rate with the fixed parameters $\beta=1$ and $\gamma=1$ , respectively. After that, we additionally performed a grid search for $\beta$ and $\gamma$ in the range $\{0.1,0.2,\dots,1\}$ . We observed that changing the learning rate for the newly found values of $\beta$ and $\gamma$ did not significantly improve the results. All the grid searches was performed for training data sets with 5% of outliers and evaluated on the noncontaminated test sets. The optimized parameters are given in Table 2. For the ensemble methods, we used the hyperparameters that were optimal for the respective non-ensemble methods, but with the half dropout rate. For BetaBayes, we used the architecture, the default settings and the optimizer based on the Edward library [32] as in the authors’ code101010https://github.com/futoshi-futami/Robust_VI, and we performe a grid search for the parameter $\beta=0.1,0.2,\dots,1$ and the standard deviation of the likelihood $\sigma=0.1,0.5,1,2,4,6,8,10.$

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Basu et al. [1998] Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. Robust and efficient estimation by minimising a density power divergence. Biometrika , 85(3):549–559, 1998.
2Bishop [2006] Bishop, C. Pattern Recognition and Machine Learning . Springer, 2006.
3Blundell et al. [2016] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on Machine Learning , pp. 7–9, July 2015, Lille, France, JMLR, 2016. W&CP 37 , 1613.
4Diakonikolas et al. [2018] Diakonikolas, I., Kamath, G., and Kane, D. M. Sever: A robust meta-algorithm for stochastic optimization. ar Xiv:1803.02815 [cs.LG] , 2018.
5Ferrari & Yang [2010] Ferrari, D. and Yang, Y. Maximum lq-likelihood estimation. Annals of Statistics , 38(2):753–783, 2010.
6Fujisawa & Eguchi [2008] Fujisawa, H. and Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis , 99(9):2053–2081, 2008.
7Futami et al. [2017] Futami, F., Sato, I., and Sugiyama, M. Variational inference based on robust divergences. 31st Annual Conference on Neural Information Processing Systems (NIPS 2017) , pp. 4–9, 2017.
8Gal & Ghahramani [2016] Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In In Proceedings of the 33rd International Conference on Machine Learning , New York, New York, USA, JMLR, 2016. W&CP 48 , 1050.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Robustness Against Outliers For Deep Neural Networks

Abstract

1 Introduction

1.1 Main idea

1.2 Related work

2 The GCP approach

2.1 GCP update

Remark 2.1**.**

2.2 Explicit dynamical system

3 Bifurcation of predictive distribution from Gaussian to Student’s t

Condition 3.1**.**

Theorem 3.1**.**

4 Prognostic mean and variance

Condition 4.1**.**

Theorem 4.1**.**

Theorem 4.2**.**

5 Experiments

5.1 Methods

5.2 Synthetic data set

5.3 Real world data sets

6 Conclusion

Appendix A Algorithm for fitting a GCP network and predicting the mean and variance of the ground truth distribution

Remark A.1**.**

Appendix B Proof of Theorem 3.13.13.1

Appendix C Proof of Theorem 4.14.14.1

C.1 Prognostic mean for any fixed ε\varepsilonε

Lemma C.1**.**

Proof.

Proof of item 1.

Proof of item 2.

C.2 An auxiliary algebraic relation

Lemma C.2**.**

Corollary C.1**.**

C.3 Proof of Theorem 4.14.14.1

Appendix D Proof of Theorem 4.24.24.2

Appendix E The RMSE(nnn) curves

Appendix F Architectures and hyperparameters

F.1 Architectures

F.2 Hyperparameters

Remark 2.1.

Condition 3.1.

Theorem 3.1.

Condition 4.1.

Theorem 4.1.

Theorem 4.2.

Remark A.1.

Appendix B Proof of Theorem $3.1$

Appendix C Proof of Theorem $4.1$

C.1 Prognostic mean for any fixed $\varepsilon$

Lemma C.1.

Lemma C.2.

Corollary C.1.

C.3 Proof of Theorem $4.1$

Appendix D Proof of Theorem $4.2$

Appendix E The RMSE( $n$ ) curves