On Stabilizing Generative Adversarial Training with Noise

Simon Jenni; Paolo Favaro

arXiv:1906.04612·cs.CV·September 18, 2019

On Stabilizing Generative Adversarial Training with Noise

Simon Jenni, Paolo Favaro

PDF

TL;DR

This paper introduces a novel filtering-based method to stabilize GAN training by extending the support of data distributions, enabling more reliable convergence and improved performance across various datasets.

Contribution

The authors propose using filtered versions of real and generated data distributions to address support limitations, enhancing GAN stability and compatibility with existing models.

Findings

01

Training becomes more stable with the filtering approach.

02

The method improves performance on multiple datasets.

03

It can be integrated into various GAN architectures.

Abstract

We present a novel method and analysis to train generative adversarial networks (GAN) in a stable manner. As shown in recent analysis, training is often undermined by the probability distribution of the data being zero on neighborhoods of the data space. We notice that the distributions of real and generated data should match even when they undergo the same filtering. Therefore, to address the limited support problem we propose to train GANs by using different filtered versions of the real and generated data distributions. In this way, filtering does not prevent the exact matching of the data distribution, while helping training by extending the support of both distributions. As filtering we consider adding samples from an arbitrary distribution to the data, which corresponds to a convolution of the data distribution with the arbitrary one. We also propose to learn the generation of…

Tables4

Table 1. Table 1 : Network architectures used for experiments on CIFAR-10 and STL-10. Images are assumed to be of size 32 × 32 32 32 32\times 32 for CIFAR-10 and 64 × 64 64 64 64\times 64 for STL-10. We set M = 512 𝑀 512 M=512 for CIFAR-10 and M = 1024 𝑀 1024 M=1024 for STL-10. Layers in parentheses are only included for STL-10. The noise-generator network follows the generator architecture with the number of channels reduced by a factor of 8. BN indicates the use of batch-normalization [ 9 ] .

Generator CIFAR-10/(STL-10)

z \in ℝ^{128} \sim 𝒩 ​ (0, I)

fully-conn. BN ReLU

4 \times 4 \times M

(deconv

4 \times 4

str.=2 BN ReLU 512)

deconv

4 \times 4

str.=2 BN ReLU 256

deconv

4 \times 4

str.=2 BN ReLU 128

deconv

4 \times 4

str.=2 BN ReLU 64

deconv

3 \times 3

str.=1 tanh 3

Discriminator CIFAR-10/(STL-10)

conv

3 \times 3

str.=1 lReLU 64

conv

4 \times 4

str.=2 BN lReLU 64

conv

4 \times 4

str.=2 BN lReLU 128

conv

4 \times 4

str.=2 BN lReLU 256

conv

4 \times 4

str.=2 BN lReLU 512

(conv

4 \times 4

str.=2 BN lReLU 1024)

fully-connected sigmoid 1

Table 2. Table 2 : We perform ablation experiments on CIFAR-10 and STL-10 to demonstrate the effectiveness of our proposed algorithm. Experiments (a)-(c) show results where only filtered examples are fed to the discriminator. Experiment (c) corresponds to previously proposed noise-annealing and results in an improvement over the standard GAN training. Our approach of feeding both filtered and clean samples to the discriminator shows a clear improvement over the baseline.

Experiment	CIFAR-10		STL-10
Experiment	FID	IS	FID	IS
Standard GAN	$46.1 \pm 0.7$	$6.12 \pm .09$	$78.4 \pm 6.7$	$8.22 \pm .37$
(a) Noise only: $ϵ \sim 𝒩 (0, I)$	$94.9 \pm 4.9$	$4.68 \pm .12$	$107.9 \pm 2.3$	$6.48 \pm .19$
(b) Noise only: $ϵ$ learned	$69.0 \pm 3.4$	$5.05 \pm .14$	$107.2 \pm 3.4$	$6.39 \pm .22$
(c) Noise only: $ϵ \sim 𝒩 (0, σ I)$ , $σ \to 0$	$44.5 \pm 3.2$	$6.85 \pm .20$	$75.9 \pm 1.9$	$8.49 \pm .19$
(d) Clean + noise: $ϵ \sim 𝒩 (0, I)$	$29.7 \pm 0.6$	$7.16 \pm .05$	$66.5 \pm 2.3$	$8.64 \pm .17$
(e) Clean + noise: $ϵ \sim 𝒩 (0, σ I)$ with learnt $σ$	$28.8 \pm 0.7$	$7.23 \pm .14$	$71.3 \pm 1.7$	$8.30 \pm .12$
(f) DFGAN $(λ = 0.1)$	$27.7 \pm 0.8$	$7.31 \pm .06$	$63.9 \pm 1.7$	$8.81 \pm .07$
(g) DFGAN $(λ = 1)$	$26.5 \pm 0.6$	$7.49 \pm .04$	$64.0 \pm 1.4$	$8.52 \pm .16$
(h) DFGAN $(λ = 10)$	$29.8 \pm 0.4$	$6.55 \pm .08$	$66.9 \pm 3.2$	$8.38 \pm .20$
(i) DFGAN alt. mini-batch $(λ = 1)$	$28.7 \pm 0.6$	$7.3 \pm .05$	$67.8 \pm 3.2$	$8.30 \pm .11$

Table 3. Table 3 : We apply our proposed GAN training to various previous GAN models trained on CIFAR-10 and CelebA. The same network architectures and hyperparameters as in the original works are used (for SVM-GAN we used the network in Table 1 ). We observe that our method increases performance in most cases even with the suggested hyperparameter settings. Note that our method also allows successful training with the original minimax MMGAN loss as opposed to the commonly used heuristic ( e.g . , in DCGAN).

Model	CIFAR-10		CelebA
Model	FID	IS	FID
MMGAN [6]	$> 450$	$\sim 1$	$> 350$
DCGAN [18]	$33.4 \pm 0.5$	$6.73 \pm .07$	$25.4 \pm 2.6$
WGAN-GP [7]	$37.7 \pm 0.4$	$6.55 \pm .08$	$15.5 \pm 0.2$
LSGAN [16]	$38.7 \pm 1.8$	$6.73 \pm .12$	$21.4 \pm 1.1$
SVM-GAN [14]	$43.9 \pm 1.0$	$6.25 \pm .09$	$26.5 \pm 1.9$
SNGAN ([17]	$29.1 \pm 0.4$	$7.26 \pm .06$	$13.2 \pm 0.3$
MMGAN +DF ( $λ = 0.1$ )	$33.1 \pm 0.7$	$6.91 \pm .05$	$16.6 \pm 1.9$
DCGAN + DF ( $λ = 10$ )	$31.2 \pm 0.3$	$6.95 \pm .11$	$14.7 \pm 1.0$
LSGAN + DF ( $λ = 10$ )	$36.7 \pm 1.2$	$6.63 \pm .17$	$19.9 \pm 0.4$
SVM-GAN + DF ( $λ = 1$ )	$28.7 \pm 1.1$	$7.31 \pm .11$	$12.7 \pm 0.7$
SNGAN + DF ( $λ = 1$ )	$25.9 \pm 0.3$	$7.47 \pm .08$	$10.5 \pm 0.4$

Table 4. Table 4 : Hyperparameter settings used to evaluate the robustness of our proposed GAN training method. We vary the learning rate α 𝛼 \alpha , the normalization in G 𝐺 G , the optimizer, the activation functions, the number of discriminator iterations n d i s c subscript 𝑛 𝑑 𝑖 𝑠 𝑐 n_{disc} and the number of training examples n t r a i n subscript 𝑛 𝑡 𝑟 𝑎 𝑖 𝑛 n_{train} .

Exp.	LR $α$	BN in $G$	Opt.	ActFn	$n_{d i s c}$	$n_{t r a i n}$
a)	$2 \cdot 10^{- 4}$	FALSE	ADAM	(l)ReLU	1	50K
b)	$2 \cdot 10^{- 4}$	TRUE	ADAM	tanh	1	50K
c)	$1 \cdot 10^{- 3}$	TRUE	ADAM	(l)ReLU	1	50K
d)	$1 \cdot 10^{- 2}$	TRUE	SGD	(l)ReLU	1	50K
e)	$2 \cdot 10^{- 4}$	TRUE	ADAM	(l)ReLU	5	50K
f)	$2 \cdot 10^{- 4}$	TRUE	ADAM	(l)ReLU	1	5K

Equations24

G min D max E_{x} [lo g D (x)] + E_{z} [lo g (1 - D (G (z)))],

G min D max E_{x} [lo g D (x)] + E_{z} [lo g (1 - D (G (z)))],

G min D max

G min D max

E_{ϵ \sim p_{ϵ}} [E_{z \sim N (0, I_{d})} [lo g (1 - D (G (z) + ϵ))]],

D (x) = \frac{\sum _{p_{ϵ} \in S} p _{d, ϵ} ( x )}{\sum _{p_{ϵ} \in S} p _{d, ϵ} ( x ) + p _{g, ϵ} ( x )},

D (x) = \frac{\sum _{p_{ϵ} \in S} p _{d, ϵ} ( x )}{\sum _{p_{ϵ} \in S} p _{d, ϵ} ( x ) + p _{g, ϵ} ( x )},

G min JSD (\frac{1}{∣ S ∣} \sum_{p_{ϵ} \in S} p_{d, ϵ}, \frac{1}{∣ S ∣} \sum_{p_{ϵ} \in S} p_{g, ϵ}),

G min JSD (\frac{1}{∣ S ∣} \sum_{p_{ϵ} \in S} p_{d, ϵ}, \frac{1}{∣ S ∣} \sum_{p_{ϵ} \in S} p_{g, ϵ}),

p_{g} min JSD (\frac{1}{2} (p_{d} + p_{d} * p_{ϵ}), \frac{1}{2} (p_{g} + p_{g} * p_{ϵ})),

p_{g} min JSD (\frac{1}{2} (p_{d} + p_{d} * p_{ϵ}), \frac{1}{2} (p_{g} + p_{g} * p_{ϵ})),

p_{d} + p_{d} * p_{ϵ} = p_{g} + p_{g} * p_{ϵ} .

p_{d} + p_{d} * p_{ϵ} = p_{g} + p_{g} * p_{ϵ} .

\hat{Δ} (ω) (1 + \overset{p}{^}_{ϵ} (ω)) = 0, \forall ω \in \hat{Ω} .

\hat{Δ} (ω) (1 + \overset{p}{^}_{ϵ} (ω)) = 0, \forall ω \in \hat{Ω} .

\int p_{ϵ} (x) e^{- j x^{⊤} ω^{*}} d x = - 1

\int p_{ϵ} (x) e^{- j x^{⊤} ω^{*}} d x = - 1

\int p_{ϵ} (x) cos (x^{⊤} ω^{*}) d x

\int p_{ϵ} (x) cos (x^{⊤} ω^{*}) d x

\int p_{ϵ} (x) sin (x^{⊤} ω^{*}) d x

\int p_{ϵ} (x) cos (x^{⊤} ω^{*}) d x > - \int p_{ϵ} (x) d x = - 1

\int p_{ϵ} (x) cos (x^{⊤} ω^{*}) d x > - \int p_{ϵ} (x) d x = - 1

\displaystyle\begin{aligned} \min_{G}\min_{\sigma}\max_{D}\lambda\Gamma+\mathbb{E}_{x}\Big{[}\log D(x)+\mathbb{E}_{\epsilon}\log D(x+\epsilon)\Big{]}+\\ \mathbb{E}_{z}\Big{[}\log[1-D(G(z))]+\mathbb{E}_{\epsilon}\log[1-D(G(z)+\epsilon)]\Big{]}.\end{aligned}

\displaystyle\begin{aligned} \min_{G}\min_{\sigma}\max_{D}\lambda\Gamma+\mathbb{E}_{x}\Big{[}\log D(x)+\mathbb{E}_{\epsilon}\log D(x+\epsilon)\Big{]}+\\ \mathbb{E}_{z}\Big{[}\log[1-D(G(z))]+\mathbb{E}_{\epsilon}\log[1-D(G(z)+\epsilon)]\Big{]}.\end{aligned}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

We present a novel method and analysis to train generative adversarial networks (GAN) in a stable manner. As shown in recent analysis, training is often undermined by the probability distribution of the data being zero on neighborhoods of the data space. We notice that the distributions of real and generated data should match even when they undergo the same filtering. Therefore, to address the limited support problem we propose to train GANs by using different filtered versions of the real and generated data distributions. In this way, filtering does not prevent the exact matching of the data distribution, while helping training by extending the support of both distributions. As filtering we consider adding samples from an arbitrary distribution to the data, which corresponds to a convolution of the data distribution with the arbitrary one. We also propose to learn the generation of these samples so as to challenge the discriminator in the adversarial training. We show that our approach results in a stable and well-behaved training of even the original minimax GAN formulation. Moreover, our technique can be incorporated in most modern GAN formulations and leads to a consistent improvement on several common datasets.

1 Introduction

Since the seminal work of [6], generative adversarial networks (GAN) have been widely used and analyzed due to the quality of the samples that they produce, in particular when applied to the space of natural images. Unfortunately, GANs still prove difficult to train. In fact, a vanilla implementation does not converge to a high-quality sample generator and heuristics used to improve the generator often exhibit an unstable behavior. This has led to a substantial work to better understand GANs (see, for instance, [23, 19, 1]). In particular, [1] points out how the unstable training of GANs is due to the (limited and low-dimensional) support of the data and model distributions.

In the original GAN formulation, the generator is trained against a discriminator in a minimax optimization problem. The discriminator learns to distinguish real from fake samples, while the generator learns to generate fake samples that can fool the discriminator. When the support of the data and model distributions is disjoint, the generator stops improving as soon as the discriminator achieves perfect classification, because this prevents the propagation of useful information to the generator through gradient descent (see Fig. 1(a)).

The recent work by [1] proposes to extend the support of the distributions by adding noise to both generated and real images before they are fed as input to the discriminator. This procedure results in a smoothing of both data and model probability distributions, which indeed increases their support extent (see Fig. 1(b)). For simplicity, let us assume that the probability density function of the data is well defined and let us denote it with $p_{d}$ . Then, samples $\tilde{x}=x+\epsilon$ , obtained by adding noise $\epsilon\sim p_{\epsilon}$ to the data samples $x\sim p_{d}$ , are also instances of the probability density function $p_{d,\epsilon}=p_{\epsilon}\ast p_{d}$ , where $\ast$ denotes the convolution operator. The support of $p_{d,\epsilon}$ is the Minkowski sum of the supports of $p_{\epsilon}$ and $p_{d}$ and thus larger than the support of $p_{d}$ . Similarly, adding noise to the samples from the generator probability density $p_{g}$ leads to the smoothed probability density $p_{g,\epsilon}=p_{\epsilon}\ast p_{g}$ . Adding noise is a quite well-known technique that has been used in maximum likelihood methods, but is considered undesirable as it yields approximate generative models that produce low-quality blurry samples. Indeed, most formulations with additive noise boil down to finding the model distribution $p_{g}$ that best solves $p_{d,\epsilon}=p_{g,\epsilon}$ . However, this usually results in a low quality estimate $p_{g}$ because $p_{d}\ast p_{\epsilon}$ has lost the high frequency content of $p_{d}$ . An immediate solution is to use a form of noise annealing, where the noise variance is initially high and is then reduced gradually during the iterations so that the original distributions, rather than the smooth ones, are eventually matched. This results in an improved training, but as the noise variance approaches zero, the optimization problem converges to the original formulation and the algorithm may be subject to the usual unstable behavior.

In this work, we design a novel adversarial training procedure that is stable and yields accurate results. We show that under some general assumptions it is possible to modify both the data and generated probability densities with additional noise without affecting the optimality conditions of the original noise-free formulation. As an alternative to the original formulation, with $z\sim{\cal N}(0,I_{d})$ and $x\sim p_{d}$ ,

[TABLE]

where $D$ denotes the discriminator, we propose to train a generative model $G$ by solving instead the following optimization

[TABLE]

where we introduced a set $\cal S$ of probability density functions. If we solve the innermost optimization problem in Problem (2), then we obtain the optimal discriminator

[TABLE]

where we have defined $p_{g}$ as the probability density of $G(z)$ , where $z\sim{\cal N}(0,I_{d})$ . If we substitute this in the problem above and simplify we have

[TABLE]

where JSD is the Jensen-Shannon divergence. We show that, under suitable assumptions, the optimal solution of Problem (4) is unique and $p_{g}=p_{d}$ . Moreover, since $\nicefrac{{1}}{{|{\cal S}|}}\sum_{p_{\epsilon}\in{\cal S}}p_{d,\epsilon}$ enjoys a larger support than $p_{d}$ , the optimization via iterative methods based on gradient descent is more likely to achieve the global minimum, regardless of the support of $p_{d}$ . Thus, our formulation enjoys the following properties:

It defines a fitting of probability densities that is not affected by their support;
It guarantees the exact matching of the data probability density function;
It can be easily applied to other GAN formulations. A simplified scheme of the proposed approach is shown in Fig. 2.

In the next sections we introduce our analysis in detail and then devise a computationally feasible approximation of the problem formulation (2). Our method is evaluated quantitatively on CIFAR-10 [12], STL-10 [5], and CelebA [15], and qualitatively on ImageNet [20] and LSUN bedrooms [24].

2 Related Work

The inherent instability of GAN training was first addressed through a set of techniques and heuristics [22] and careful architectural design choices and hyper-parameter tuning [18]. [22] proposes the use of one-sided label smoothing and the injection of Gaussian noise into the layers of the discriminator. A theoretical analysis of the unstable training and the vanishing gradients phenomena was introduced by Arjovsky et al. [1]. They argue that the main source of instability stems from the fact that the real and the generated distributions have disjoint supports or lie on low-dimensional manifolds. In the case of an optimal discriminator this will result in zero gradients that then stop the training of the generator. More importantly, they also provide a way to avoid such difficulties by introducing noise and considering “softer” metrics such as the Wasserstein distance. [23] makes similar observations and also proposed the use of “instance noise” which is gradually reduced during training as a way to overcome these issues. Another recent work stabilizes GAN training in a similar way by transforming examples before feeding them to the discriminator [21]. The amount of transformation is then gradually reduced during training. They only transform the real examples, in contrast to [23], [1] and our work. [2] builds on the work of [1] and introduces the Wasserstein GAN (WGAN). The WGAN optimizes an integral probability metric that is the dual to the Wasserstein distance. This formulation requires the discriminator to be Lipschitz-continuous, which is realized through weight-clipping. [7] presents a better way to enforce the Lipschitz constraint via a gradient penalty over interpolations between real and generated data (WGAN-GP). [19] introduces a stabilizing regularizer based on a gradient norm penalty similar to that by [7]. Its formulation however is in terms of f-divergences and is derived via an analytic approximation of adversarial training with additive Gaussian noise on the datapoints. Another recent GAN regularization technique that bounds the Lipschitz constant of the discriminator is the spectral normalization introduced by [17]. This method demonstrates state-of-the-art in terms of robustness in adversarial training. Several alternative loss functions and GAN models have been proposed over the years, claiming superior stability and sample quality over the original GAN (e.g., [16], [25], [3], [2], [25], [11]). Adversarial noise generation has previously been used in the context of classification to improve the robustness against adversarial perturbations [13].

3 Matching Filtered Distributions

We are interested in finding a formulation that yields as optimal generator $G$ a sampler of the data probability density function (pdf) $p_{d}$ , which we assume is well defined. The main difficulty in dealing with $p_{d}$ is that it may be zero on some neighborhood in the data space. An iterative optimization of Problem (1) based on gradient descent may yield a degenerate solution, i.e., such that the model pdf $p_{g}$ only partially overlaps with $p_{d}$ (a scenario called mode collapse). It has been noticed that adding samples of an arbitrary distribution to both real and fake data samples during training helps reduce this issue. In fact, adding samples $\epsilon\sim p_{\epsilon}$ corresponds to blurring the original pdfs $p_{d}$ and $p_{g}$ , an operation that is known to increase their support and thus their likelihood to overlap. This increased overlap means that iterative methods can exploit useful gradient directions at more locations and are then more likely to converge to the global solution. By building on this observation, we propose to solve instead Problem (2) and look for a way to increase the support of the data pdf $p_{d}$ without losing the optimality conditions of the original formulation of Problem (1).

Our result below proves that this is the case for some choices of the additive noise. We consider images of $m\times n$ pixels and with values in a compact domain $\Omega\subset^{m\times n}$ , since image intensities are bounded from above and below. Then, also the support of the pdf $p_{d}$ is bounded and contained in $\Omega$ . This implies that $p_{d}$ is also $L^{2}(\Omega)$ .

Theorem 1.

Let us choose ${\cal S}$ such that Problem (4) can be written as

[TABLE]

where $p_{\epsilon}$ is a non-degenerate probability density function in $L^{2}(\Omega)$ . Let us also assume that the domain of $p_{g}$ is restricted to $\Omega$ (and thus $p_{g}\in L^{2}(\Omega)$ ). Then, the global optimum of Problem (5) is $p_{g}(x)=p_{d}(x)$ , $\forall x\in\Omega$ .

Proof.

The global minimum of the Jensens-Shannon divergence is achieved if and only if

[TABLE]

Let $p_{g}=p_{d}+\Delta$ . Then, we have $\int\Delta(x)dx=0$ and $\int|\Delta(x)|^{2}dx<\infty$ . By substituting $p_{g}$ in eq. (6) we obtain $\Delta\ast p_{\epsilon}=-\Delta$ . Since $\Delta$ and $p_{\epsilon}$ are in $L^{2}(\Omega)$ , we can take the Fourier transform of both sides, and obtain

[TABLE]

If $\Delta(x)\neq 0$ for some $x$ , then there exists $\omega^{\ast}$ such that $\Delta(\omega^{\ast})\neq 0$ , and thus $1+\hat{p}_{\epsilon}(\omega^{\ast})=0$ . This means that

[TABLE]

or, equivalently,

[TABLE]

Notice that

[TABLE]

unless $p_{\epsilon}(x)=0$ for any $x$ such that $x^{\top}\omega^{\ast}\neq\pi+2k\pi$ , with $k\in\mathbb{Z}$ . Since $p_{\epsilon}$ is not degenerate, then eq. (11) holds, and eq. (8) cannot be true, which leads to $\Delta(x)=0$ for all $x\in\Omega$ , and we can conclude that $p_{g}(x)=p_{d}(x)$ , $\forall x\in\Omega$ . ∎

3.1 Formulation

Based on the above theorem we consider two cases:

Gaussian noise with a fixed/learned standard deviation $\sigma$ : $p_{\epsilon}(\epsilon)={\cal N}(\epsilon;0,\sigma I_{d})$ ; 2. 2.

Learned noise from a noise generator network $N$ with parameters $\sigma$ : $p_{\epsilon}(\epsilon)\text{ such that }\epsilon=N(w,\sigma),\text{ with }w\sim{\cal N}(0,I_{d}).$

In both configurations we can learn the parameter(s) $\sigma$ . We do so by minimizing the cost function after the maximization with respect to the discriminator. The minimization encourages large noise since this would make $p_{d,\epsilon}(\omega)$ more similar to $p_{g,\epsilon}(\omega)$ regardless of $p_{d}$ and $p_{g}$ . This would not be very useful to gradient descent. Therefore, to limit the noise magnitude we introduce as a regularization term the noise variance $\Gamma(\sigma)=\sigma^{2}$ or the Euclidean norm of the noise output image $\Gamma(\sigma)=\mathbb{E}_{w\sim{\cal N}(0,I_{d})}|N(w,\sigma)|^{2}$ , and multiply it by a positive scalar $\lambda$ , which we tune.

The proposed formulations can then be written in a unified way as:

[TABLE]

3.2 Implementation

Implementing our algorithm only requires a few minor modifications of the standard GAN framework. We perform the update for the noise-generator and the discriminator in the same iteration. Mini-batches for the discriminator are formed by collecting all the fake and real samples in two separate batches, i.e., $\{x_{1},\ldots,x_{m},x_{1}+\epsilon_{1},\ldots,x_{m}+\epsilon_{m}\}$ is the batch with real examples and $\{\tilde{x}_{1},\ldots,\tilde{x}_{m},\tilde{x}_{1}+\epsilon_{1},\ldots,\tilde{x}_{m}+\epsilon_{m}\}$ the fake examples batch. The complete procedure is outlined in Algorithm 1. The noise-generator architecture is typically the same as the generator, but with a reduced number of convolutional filters. Since the inputs to the discriminator are doubled when compared to the standard GAN framework, the DFGAN framework can be $1.5$ to $2$ times slower. Similar and more severe performance drops are present in existing variants (e.g., WGAN-GP). Note that by constructing the batches as $\{x_{1},\ldots,x_{m/2},x_{m/2+1}+\epsilon_{1},\ldots,x_{m}+\epsilon_{m}\}$ the training time is instead comparable to the standard framework, but it is much more stable and yields an accurate generator. For a comparison of the runtimes, see Fig. 4.

3.3 Batch-Normalization and Mode Collapse

The current best practice is to apply batch normalization to the discriminator separately on the real and fake mini-batches [4]. Indeed, this showed much better results when compared to feeding mini-batches with a 50/50 mix of real and fake examples in our experiments. The reason for this is that batch normalization implicitly takes into account the distribution of examples in each mini-batch. To see this, consider the example in Fig. 3. In the case of no separate normalization of fake and real batches we can observe mode-collapse. The modes covered by the generator are indistinguishable for the discriminator, which observes each example independently. There is no signal to the generator that leads to better mode coverage in this case. Since the first two moments of the fake and real batch distribution are clearly not matching, a separate normalization will help the discriminator distinguish between real and fake examples and therefore encourage better mode coverage by the generator.

Using batch normalization in this way turns out to be crucial for our method as well. Indeed, when no batch normalization is used in the discriminator, the generator will often tend to produce noisy examples. This is difficult to detect by the discriminator, since it judges each example independently. To mitigate this issue we apply separate normalization of the noisy real and fake examples before feeding them to the discriminator. We use this technique for models without batch normalization (e.g. SNGAN).

4 Experiments

We compare and evaluate our model using two common GAN metrics: the Inception score IS [22] and the Fréchet Inception distance FID [8]. Throughout this section we use 10K generated and real samples to compute IS and FID. In order to get a measure of the stability of the training we report the mean and standard deviation of the last five checkpoints for both metrics (obtained in the last 10% of training). More reconstructions, experiments and details are provided in the supplementary material.

4.1 Ablations

To verify our model we perform ablation experiments on two common image datasets: CIFAR-10 [12] and STL-10 [5]. For CIFAR-10 we train on the 50K $32\times 32$ RGB training images and for STL-10 we resize the 100K $96\times 96$ training images to $64\times 64$ . The network architectures resemble the DCGAN architectures of [18] and are detailed in Table 1. All the models are trained for 100K generator iterations using a mini-batch size of 64. We use the ADAM optimizer [10] with a learning rate of $10^{-4}$ and $\beta_{1}=0.5$ . Results on the following ablations are reported in Table 2:

(a)-(c) Only noisy samples:

In this set of experiments we only feed noisy examples to the discriminator. In experiment (a) we add Gaussian noise and in (b) we add learned noise. In both cases the noise level is not annealed. While this leads to stable training, the resulting samples are of poor quality which is reflected by high FID and low IS. The generator will tend to also produce noisy samples since there is no incentive to remove the noise. Annealing the added noise during training as proposed by [1] and [23] leads to an improvement over the standard GAN. This is demonstrated in experiment (c). The added Gaussian noise is linearly annealed during the 100K iterations in this case;

(d)-(i) Both noisy and clean samples:

The second set of experiments consists of variants of our proposed model. Experiments (d) and (e) use a simple Gaussian noise model; in (e) the standard deviation of the noise $\sigma$ is learned. We observe a drastic improvement in the quality of the generated examples even with this simple modification. The other experiments show results of our full model with a separate noise-generator network. We vary the weight $\lambda$ of the $L^{2}$ norm of the noise in experiments (f)-(h). Ablation (i) uses the alternative mini-batch construction with faster runtime as described in Section 3.2;

Application to Different GAN Models. We investigate the possibility of applying our proposed training method to several standard GAN models. The network architectures are the same as proposed in the original works with only the necessary adjustments to the given image-resolutions of the datasets (i.e., truncation of the network architectures). The only exception is SVM-GAN, where we use the architecture in Table 1. Note that for the GAN with minimax loss (MMGAN) and WGAN-GP we use the architecture of DCGAN. Hyper-parameters are kept at their default values for each model. The models are evaluated on two common GAN benchmarks: CIFAR-10 [12] and CelebA [15]. The image resolution is $32\times 32$ for CIFAR-10 and $64\times 64$ for CelebA. All models are trained for 100K generator iterations. For the alternative objective function of LSGAN and SVM-GAN we set the loss of the noise generator to be the negative of the discriminator loss, as is the case in our standard model. The results are shown in Table 3. We can observe that applying our training method improves performance in most cases and even enables the training with the original saturation-prone minimax GAN objective, which is very unstable otherwise. Note also that applying our method to SNGAN [17] (the current state-of-the-art) leads to an improvement on both datasets. We also evaluated SNGAN with and without our method on $64\times 64$ images of STL-10 (same as in Table 2) where our method boosts the performance from an FID of $66.3\pm 1.1$ to $58.3\pm 1.4$ . We show random CelebA reconstructions from models trained with and without our approach in Fig. 5.

Robustness to Hyperparameters. We test the robustness of DFGANs with respect to various hyperparamters by training on CIFAR-10 with the settings listed in Table 4. The network is the same as specified in Table 1. The noise penalty term is set to $\lambda=0.1$ . We compare to a model without our training method (Standard), a model with the gradient penalty regularization proposed by [19] (GAN+GP) and a model with spectral normalization (SNGAN). To the best of our knowledge, these methods are the current state-of-the-art in terms of GAN stabilization. Fig. 7 shows that our method is stable and accurate across all settings.

Robustness to Network Architectures. To test the robustness of DFGANs against non-optimal network architectures we modified the networks in Table 1 by doubling the number of layers in both generator and discriminator. This leads to significantly worse performance in terms of FID in all cases: 46 to 135 (Standard), 33 to 111 (SNGAN), 28 to 36 (GAN+GP), and 27 to 60 (DFGAN). However, SNGAN+DF leads to good results with a FID of 27.6.

4.2 Qualitative Results

We trained DFGANs on $128\times 128$ images from two large-scale datasets: ImageNet [20] and LSUN bedrooms [24]. The network architecture is similar to the one in Table 1 with one additional layer in both networks. We trained the models for 100K iterations on LSUN and 300K iterations on ImageNet. Random samples of the models are shown in Fig. 6. In Fig. 8 we show some examples of the noise that is produced by the noise generator at different stages during training. These examples resemble the image patterns that typically appear when the generator diverges.

5 Conclusions

We have introduced a novel method to stabilize generative adversarial training that results in accurate generative models. Our method is rather general and can be applied to other GAN formulations with an average improvement in generated sample quality and variety, and training stability. Since GAN training aims at matching probability density distributions, we add random samples to both generated and real data to extend the support of the densities and thus facilitate their matching through gradient descent. We demonstrate the proposed training method on several common datasets of real images.

Acknowledgements. This work was supported by the Swiss National Science Foundation (SNSF) grant number 200021_169622. We also wish to thank Abdelhak Lemkhenter for discussions and for help with the proof of Theorem 1.

1 Influence on the Generator Gradient Norm

We compare the norm of the generator gradient with and without DF for a GAN trained with the original minimax objective and a GAN trained with the alternative generator objective $\max_{G}\log(D(z))$ in Figure 1. The models were trained on CIFAR-10. We can observe the vanishing gradient phenomenon in Figure 1(a) when no distribution filtering is applied. With our proposed method the gradient norms are stable. In the case of the alternative loss in Figure 1(b) we can observe that the gradient norms are orders of magnitude higher when no distribution filtering is applied. This results in highly unstable weight updates due to the overconfident discriminator.

2 Experiments on synthetic data

We performed experiments with a standard GAN and a DFGAN using Gaussian noise on synthetic 2-D data. The generator and discriminator architectures are both MLPs consisting of three fully-connected layers with a hidden-layer size of 512. We use ReLU activations and batch-normalization ([9]) in all but the first discriminator layer and the output layers. The Adam optimzer ([10]) was used with a learning rate of $10^{-4}$ and we trained for 20K iterations. The results are shown in Figure 2. We can observe how the matching of both clean and filtered distribution leads to a better fit in the case of DFGAN.

3 Implementation Details

Noise Generator. The noise-generator architecture in all our experiments is equivalent to the generator architecture with the number of filters reduced by a factor of eight. The output of the noise-generator has a tanh activation scaled by a factor of two to allow more noise if necessary. We also experimented with a linear activation but didn’t find a significant difference in performance.

GAN+GP. For the comparisons to the GAN regularizer proposed by [19] we used the same settings as used in their work in experiments with DCGAN.

SNGAN+DF. We used the standard GAN loss (same as DCGAN) in all our experiments with models using spectral normalization. When combining SNGAN with DF we batch-normalized the noisy inputs to the discriminator.

4 Qualitative Results for Experiments

We provide qualitative results for some of the ablation experiments in Figure 3 and for the robustness experiments in Figure 4. As we can see in Figure 4, none of the tested settings led to degenerate solutions in the case of DFGAN while the other methods would show failure cases in some settings.

5 Application to Progressive GAN

To test our method on a state-of-the-art GAN we applied our training method to the progressive GAN model. We used the DCGAN loss, trained for a total of 6M images and did not use label conditioning. We used fixed Gaussian noise for the distribution filtering. On CIFAR-10 progressiveGAN without DF achieved a FID of 29.4. Adding DF improved the performance to 26.8. Note that the original WGAN-GP loss in the same setup only achieved a FID of 29.8.

We also trained progressive-GAN+DF on higher resolution $256\times 256$ images of LSUN bedrooms. See Figure 5 for results.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. ar Xiv preprint ar Xiv:1701.04862 , 2017.
2[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning , pages 214–223, 2017.
3[3] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. ar Xiv preprint ar Xiv:1703.10717 , 2017.
4[4] Soumith Chintala, Emily Denton, Martin Arjovsky, and Michael Mathieu. How to train a gan? tips and tricks to make gans work. https://github.com/soumith/ganhacks , 2016.
5[5] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages 215–223, 2011.
6[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems , pages 2672–2680, 2014.
7[7] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems , pages 5769–5779, 2017.
8[8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a nash equilibrium. ar Xiv preprint ar Xiv:1706.08500 , 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Generative Adversarial Training by Blurring the Data Distribution Support

Generative Adversarial Training via Multi-Smoothness Support Matching

A Principled Generative Adversarial Training via Multiple Data Distribution Filterings

A Principled Generative Adversarial Training via Data Distribution Filterings

Stable Generative Adversarial Training via Data Distribution Filtering

A Stable Generative Adversarial Training via Data Distribution Filtering

On Stabilizing Generative Adversarial Training with Noise

Abstract

1 Introduction

2 Related Work

3 Matching Filtered Distributions

Theorem 1**.**

Proof.

3.1 Formulation

3.2 Implementation

3.3 Batch-Normalization and Mode Collapse

4 Experiments

4.1 Ablations

4.2 Qualitative Results

5 Conclusions

1 Influence on the Generator Gradient Norm

2 Experiments on synthetic data

3 Implementation Details

4 Qualitative Results for Experiments

5 Application to Progressive GAN

Theorem 1.