Biadversarial Variational Autoencoder

Arnaud Fickinger

arXiv:1902.03517·cs.LG·February 13, 2019

Biadversarial Variational Autoencoder

Arnaud Fickinger

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Biadversarial Variational Autoencoder that replaces Gaussian assumptions with adversarial networks, enabling better modeling of multimodal distributions and improving image quality.

Contribution

It proposes a novel VAE framework using adversarial networks to avoid Gaussian assumptions, enhancing the ability to model complex, multimodal data distributions.

Findings

01

Avoids Gaussian assumptions in VAE

02

Improves modeling of multimodal distributions

03

Produces sharper, higher-quality images

Abstract

In the original version of the Variational Autoencoder, Kingma et al. assume Gaussian distributions for the approximate posterior during the inference and for the output during the generative process. This assumptions are good for computational reasons, e.g. we can easily optimize the parameters of a neural network using the reparametrization trick and the KL divergence between two Gaussians can be computed in closed form. However it results in blurry images due to its difficulty to represent multimodal distributions. We show that using two adversarial networks, we can optimize the parameters without any Gaussian assumptions.

Equations38

lo g p_{θ} (x) = = \geq lo g \int p_{θ} (x, z) d z lo g \int \frac{p _{θ} ( x , z ) q _{ϕ} ( z ∣ x )}{q _{ϕ} ( z ∣ x )} d z E_{q_{ϕ} (z ∣ x)} (lo g p_{θ} (x ∣ z) + lo g p (z) - lo g q_{ϕ} (z ∣ x)) \equiv ELBO

lo g p_{θ} (x) = = \geq lo g \int p_{θ} (x, z) d z lo g \int \frac{p _{θ} ( x , z ) q _{ϕ} ( z ∣ x )}{q _{ϕ} ( z ∣ x )} d z E_{q_{ϕ} (z ∣ x)} (lo g p_{θ} (x ∣ z) + lo g p (z) - lo g q_{ϕ} (z ∣ x)) \equiv ELBO

q_{ϕ} (z ∣ x) = N (z, μ_{z}, σ_{z}^{2} I)

q_{ϕ} (z ∣ x) = N (z, μ_{z}, σ_{z}^{2} I)

ELBO \equiv E_{z \sim q_{ϕ} (z ∣ x)} (lo g p_{θ} (x ∣ z)) + E_{z \sim q_{ϕ} (z ∣ x)} (lo g p (z) - lo g q_{ϕ} (z ∣ x))

ELBO \equiv E_{z \sim q_{ϕ} (z ∣ x)} (lo g p_{θ} (x ∣ z)) + E_{z \sim q_{ϕ} (z ∣ x)} (lo g p (z) - lo g q_{ϕ} (z ∣ x))

θ max ϕ max E_{x \sim \overset{p}{^}_{d a t a}} ELBO

θ max ϕ max E_{x \sim \overset{p}{^}_{d a t a}} ELBO

= θ min ϕ min E_{x \sim \overset{p}{^}_{d a t a}} E_{z \sim q_{ϕ} (z ∣ x)} (lo g q_{ϕ} (z ∣ x) - lo g p (z)) ϕ min E_{x \sim \overset{p}{^}_{d a t a}} K L (q_{ϕ} (z ∣ x) ∣∣ p (z))

= θ min ϕ min E_{x \sim \overset{p}{^}_{d a t a}} E_{z \sim q_{ϕ} (z ∣ x)} (lo g q_{ϕ} (z ∣ x) - lo g p (z)) ϕ min E_{x \sim \overset{p}{^}_{d a t a}} K L (q_{ϕ} (z ∣ x) ∣∣ p (z))

D_{ϕ} max V (D_{ϕ}, ϕ) \equiv E_{x \sim \overset{p}{^}_{d a t a}} (E_{z \sim q_{ϕ} (z ∣ x)} (1 - D_{ϕ} (x, z)) - E_{z \sim p (z)} (exp (- D_{ϕ} (x, z))))

D_{ϕ} max V (D_{ϕ}, ϕ) \equiv E_{x \sim \overset{p}{^}_{d a t a}} (E_{z \sim q_{ϕ} (z ∣ x)} (1 - D_{ϕ} (x, z)) - E_{z \sim p (z)} (exp (- D_{ϕ} (x, z))))

E_{x \sim \overset{p}{^}_{d a t a}} (E_{z \sim q_{ϕ} (z ∣ x)} (1 - D_{ϕ} (x, z)) - E_{z \sim p (z)} (exp (- D_{ϕ} (x, z)))) = \int \overset{p}{^}_{d a t a} (x) (q_{ϕ} (z ∣ x) (1 - D_{ϕ} (x, z)) + p (z) exp (- D_{ϕ} (x, z))) d z d x

E_{x \sim \overset{p}{^}_{d a t a}} (E_{z \sim q_{ϕ} (z ∣ x)} (1 - D_{ϕ} (x, z)) - E_{z \sim p (z)} (exp (- D_{ϕ} (x, z)))) = \int \overset{p}{^}_{d a t a} (x) (q_{ϕ} (z ∣ x) (1 - D_{ϕ} (x, z)) + p (z) exp (- D_{ϕ} (x, z))) d z d x

\forall x, z, D_{ϕ}^{*} (x, z) = lo g (\frac{p ( z )}{q _{ϕ} ( z ∣ x )})

\forall x, z, D_{ϕ}^{*} (x, z) = lo g (\frac{p ( z )}{q _{ϕ} ( z ∣ x )})

V (D_{θ, θ}^{*}, θ, θ) = = = E_{x \sim \overset{p}{^}_{d a t a}} (E_{z \sim q_{ϕ} (z ∣ x)} (1 - lo g (\frac{p ( z )}{q _{ϕ} ( z ∣ x )})) - E_{z \sim p (z)} (exp (- lo g (\frac{p ( z )}{q _{ϕ} ( z ∣ x )})))) E_{x \sim \overset{p}{^}_{d a t a}} (E_{z \sim q_{ϕ} (z ∣ x)} (lo g (\frac{q _{ϕ} ( z ∣ x )}{p ( z )})) E_{x \sim \overset{p}{^}_{d a t a}} K L (q_{ϕ} (z ∣ x) ∣∣ p (z))

V (D_{θ, θ}^{*}, θ, θ) = = = E_{x \sim \overset{p}{^}_{d a t a}} (E_{z \sim q_{ϕ} (z ∣ x)} (1 - lo g (\frac{p ( z )}{q _{ϕ} ( z ∣ x )})) - E_{z \sim p (z)} (exp (- lo g (\frac{p ( z )}{q _{ϕ} ( z ∣ x )})))) E_{x \sim \overset{p}{^}_{d a t a}} (E_{z \sim q_{ϕ} (z ∣ x)} (lo g (\frac{q _{ϕ} ( z ∣ x )}{p ( z )})) E_{x \sim \overset{p}{^}_{d a t a}} K L (q_{ϕ} (z ∣ x) ∣∣ p (z))

ϕ min D_{ϕ} max V (D_{ϕ}, ϕ)

ϕ min D_{ϕ} max V (D_{ϕ}, ϕ)

p (x ∣ z) = = N (x ∣ μ, σ^{2} I) \frac{1}{( 2 π ) ^{n /2} σ} exp (- \frac{∣∣ x - μ ∣ ∣ _{2}^{2}}{2 σ ^{2}})

p (x ∣ z) = = N (x ∣ μ, σ^{2} I) \frac{1}{( 2 π ) ^{n /2} σ} exp (- \frac{∣∣ x - μ ∣ ∣ _{2}^{2}}{2 σ ^{2}})

- lo g p (x ∣ z) = lo g ((2 π)^{n /2} σ) + \frac{∣∣ x - μ ∣ ∣ _{2}^{2}}{2 σ ^{2}}

- lo g p (x ∣ z) = lo g ((2 π)^{n /2} σ) + \frac{∣∣ x - μ ∣ ∣ _{2}^{2}}{2 σ ^{2}}

= = θ arg max ϕ arg max E_{x \sim \overset{p}{^}_{d a t a}} E_{z \sim q_{ϕ} (z ∣ x)} (lo g p_{θ} (x ∣ z)) θ arg min ϕ arg min E_{z \sim q_{ϕ} (z ∣ x)} E_{x \sim \overset{p}{^}_{d a t a}} (lo g \overset{p}{^}_{d a t a} (x) - lo g p_{θ} (x ∣ z)) θ arg min ϕ arg min E_{z \sim q_{ϕ} (z ∣ x)} K L (\overset{p}{^}_{d a t a} (x) ∣∣ p_{θ} (x ∣ z))

= = θ arg max ϕ arg max E_{x \sim \overset{p}{^}_{d a t a}} E_{z \sim q_{ϕ} (z ∣ x)} (lo g p_{θ} (x ∣ z)) θ arg min ϕ arg min E_{z \sim q_{ϕ} (z ∣ x)} E_{x \sim \overset{p}{^}_{d a t a}} (lo g \overset{p}{^}_{d a t a} (x) - lo g p_{θ} (x ∣ z)) θ arg min ϕ arg min E_{z \sim q_{ϕ} (z ∣ x)} K L (\overset{p}{^}_{d a t a} (x) ∣∣ p_{θ} (x ∣ z))

D_{θ, ϕ} max V (D_{θ, ϕ}, θ, ϕ) \equiv E_{z \sim q_{ϕ} (z ∣ x)} (E_{x \sim \overset{p}{^}_{d a t a}} (D_{θ, ϕ} (x, z)) - E_{x \sim p_{θ} (x ∣ z)} (exp (D_{θ, ϕ} (x, z) - 1)))

D_{θ, ϕ} max V (D_{θ, ϕ}, θ, ϕ) \equiv E_{z \sim q_{ϕ} (z ∣ x)} (E_{x \sim \overset{p}{^}_{d a t a}} (D_{θ, ϕ} (x, z)) - E_{x \sim p_{θ} (x ∣ z)} (exp (D_{θ, ϕ} (x, z) - 1)))

E_{z \sim q_{ϕ} (z ∣ x)} (E_{x \sim \overset{p}{^}_{d a t a}} (D_{θ, ϕ} (x, z)) - E_{x \sim p_{θ} (x ∣ z)} (exp (D_{θ, ϕ} (x, z) - 1))) = \int q_{ϕ} (z ∣ x) (\overset{p}{^}_{d a t a} (x) (D_{θ, ϕ} (x, z)) + p_{θ} (x ∣ z) exp (D_{θ, ϕ} (x, z) - 1)) d x d z

E_{z \sim q_{ϕ} (z ∣ x)} (E_{x \sim \overset{p}{^}_{d a t a}} (D_{θ, ϕ} (x, z)) - E_{x \sim p_{θ} (x ∣ z)} (exp (D_{θ, ϕ} (x, z) - 1))) = \int q_{ϕ} (z ∣ x) (\overset{p}{^}_{d a t a} (x) (D_{θ, ϕ} (x, z)) + p_{θ} (x ∣ z) exp (D_{θ, ϕ} (x, z) - 1)) d x d z

\forall x, z, D_{θ, ϕ}^{*} (x, z) = 1 + lo g (\frac{p ^ _{d a t a} ( x )}{p _{θ} ( x ∣ z )})

\forall x, z, D_{θ, ϕ}^{*} (x, z) = 1 + lo g (\frac{p ^ _{d a t a} ( x )}{p _{θ} ( x ∣ z )})

V (D_{θ, ϕ}^{*}, θ, ϕ) = = = E_{z \sim q_{ϕ} (z ∣ x)} (E_{x \sim \overset{p}{^}_{d a t a}} (1 + lo g (\frac{p ^ _{d a t a} ( x )}{p _{θ} ( x ∣ z )})) - E_{x \sim p_{θ} (x ∣ z)} (exp (1 + lo g (\frac{p ^ _{d a t a} ( x )}{p _{θ} ( x ∣ z )}) - 1))) E_{z \sim q_{ϕ} (z ∣ x)} (E_{x \sim \overset{p}{^}_{d a t a}} (lo g (\frac{p ^ _{d a t a} ( x )}{p _{θ} ( x ∣ z )})) E_{z \sim q_{ϕ}} K L (\overset{p}{^}_{d a t a} (x) ∣∣ p_{θ} (x ∣ z))

V (D_{θ, ϕ}^{*}, θ, ϕ) = = = E_{z \sim q_{ϕ} (z ∣ x)} (E_{x \sim \overset{p}{^}_{d a t a}} (1 + lo g (\frac{p ^ _{d a t a} ( x )}{p _{θ} ( x ∣ z )})) - E_{x \sim p_{θ} (x ∣ z)} (exp (1 + lo g (\frac{p ^ _{d a t a} ( x )}{p _{θ} ( x ∣ z )}) - 1))) E_{z \sim q_{ϕ} (z ∣ x)} (E_{x \sim \overset{p}{^}_{d a t a}} (lo g (\frac{p ^ _{d a t a} ( x )}{p _{θ} ( x ∣ z )})) E_{z \sim q_{ϕ}} K L (\overset{p}{^}_{d a t a} (x) ∣∣ p_{θ} (x ∣ z))

θ min ϕ min D_{θ, ϕ} max V (D_{θ, ϕ}, θ, ϕ)

θ min ϕ min D_{θ, ϕ} max V (D_{θ, ϕ}, θ, ϕ)

θ min ϕ min (D_{θ, ϕ} max V (D_{θ, ϕ}, θ, ϕ) + D_{ϕ} max V (D_{ϕ}, ϕ))

θ min ϕ min (D_{θ, ϕ} max V (D_{θ, ϕ}, θ, ϕ) + D_{ϕ} max V (D_{ϕ}, ϕ))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ArnaudFickinger/BAVAE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Anomaly Detection Techniques and Applications · Adversarial Robustness in Machine Learning

Full text

Biadversarial Variational Autoencoder

Arnaud Fickinger

Department of Computer Science

Ecole Polytechnique

Palaiseau, FRANCE

[email protected]

Abstract

In the original version of the variational autoencoder 2013arXiv1312.6114K , Kingma et al. assume Gaussian distributions for the approximate posterior during the inference and for the output during the generative process. This assumptions are good for computational reasons, e.g. we can easily optimize the parameters of a neural network using the reparametrization trick and the KL divergence between two Gaussians can be computed in closed form. However it results in blurry images due to its difficulty to represent multimodal distributions. We show that using two adversarial networks, we can optimize the parameters without any Gaussian assumptions.

1 Introduction

We want to maximize the evidence lower bound (ELBO) of the marginal likelihood $p_{\theta}(x)$ . We can derive the ELBO with the Jensen inequality by marginalizing out the latent variable $z$ and introducing the approximate posterior $q_{\phi}(z|x)$ :

[TABLE]

2 Inference

Many works on variational autoencoder assume a Gaussian distribution for the approximate posterior distribution:

[TABLE]

where $\mu_{z}$ and $\sigma_{z}$ are neural network functions.

This is convenient for computation but very restrictive for $z$ . We introduce an adversarial network that will optimize the parameters of the encoder without the need of any restrictive assumption. To do that, rearrange eq. (1):

[TABLE]

The objective being:

[TABLE]

where $\phi$ denotes the parameter of the encoder and $\theta$ denotes the parameters of the decoder.

Rewrite the second term of the ELBO in eq. (3) to bring out a Kullback-Leibler (KL) divergence:

[TABLE]

This term corresponds to the KL divergence between the approximate posterior $q_{\phi}(z|x)$ and the prior $p(z)$ . Note that it is the reverse KL divergence, ie. the difference between both distributions is bounded by the approximate posterior, which is a better option to learn real modes in case of a multimodal distribution. Inspired by 2016arXiv160600709N , we define a network with an objective that differs slightly from the original adversarial network 2014arXiv1406.2661G so the associated generator learns to minimize the reverse KL divergence instead of the Jensen-Shannon divergence. In so doing we are able to optimize the parameters without doing any parametric assumption on the posterior. Introduce the network $\mathcal{D}_{\phi}:X\times Z\longrightarrow\mathbb{R}$ with the following objective:

[TABLE]

where the parameters $\phi$ is fixed.

Inspired by 2014arXiv1406.2661G , write the second term as an integral to find the optimal value of $\mathcal{D}_{\phi}$ :

[TABLE]

Given a pair $(a,b)$ in $\mathbb{R}^{2}$ , the function $d\in\mathbb{R}\mapsto a(1-d)-b\exp(-d)$ reaches its maximum at $d^{*}=\log(\frac{b}{a})$ . Hence the maximum of the integral is reached if:

[TABLE]

By replacing eq. (8) in eq. (6), we show that the optimal value function $V(\mathcal{D}^{*}_{\theta,\theta},\theta,\theta)$ reached by the discriminator, the generator being fixed, is the KL divergence in eq. (5):

[TABLE]

In so doing we can optimize the second term of the ELBO with a minimax game with value function $V(\mathcal{D}_{\phi},\phi)$ :

[TABLE]

3 Generative process

Many works on variational autoencoder assume also a Gaussian distribution for the output distribution:

[TABLE]

where $\mu$ is a neural network function and $\sigma^{2}$ is a hyperparameter.

The negative log likelihood of this distribution is an affine function of the L2 norm, hence we often encounter a L2 reconstruction term in works on variational autoencoders :

[TABLE]

Rearrange the first term of the objective in eq. (3) to bring out a direct KL divergence:

[TABLE]

This time we choose an adversarial objective so that the associated generator learns to minimize the direct KL divergence. Introduce the network $\mathcal{D}_{\theta,\phi}:X\times Z\longrightarrow\mathbb{R}$ with the following objective:

[TABLE]

where the parameters $\theta$ and $\phi$ are fixed.

Write the second term as an integral to find the optimal value of $\mathcal{D}_{\theta,\phi}$ :

[TABLE]

Given a pair $(a,b)$ in $\mathbb{R}^{2}$ , the function $d\in\mathbb{R}\mapsto ad-b\exp(d-1)$ reaches its maximum at $d^{*}=1+\log(\frac{a}{b})$ . Hence the maximum of the integral is reached if:

[TABLE]

By replacing eq. (16) in eq. (14), we show that the optimal value function $V(\mathcal{D}^{*}_{\theta,\phi},\theta,\phi)$ reached by the discriminator, the generator being fixed, is the direct KL divergence in eq. (13):

[TABLE]

In so doing we can optimize the second term of the ELBO with a minimax game with value function $V(\mathcal{D}_{\theta,\phi},\theta,\phi)$ :

[TABLE]

Finally we have transformed the optimization of the ELBO into a minimax game involving two discriminators:

[TABLE]

4 Implementation

The model is implemented with PyTorch. The implementation is available here:

https://github.com/ArnaudFickinger/BAVAE.

Bibliography3

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. ar Xiv e-prints , page ar Xiv:1312.6114, December 2013.
2(2) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. ar Xiv e-prints , page ar Xiv:1606.00709, June 2016.
3(3) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. ar Xiv e-prints , page ar Xiv:1406.2661, June 2014.