The Variational InfoMax AutoEncoder

Vincenzo Crescimanna; Bruce Graham

arXiv:1905.10549·cs.LG·November 10, 2020

The Variational InfoMax AutoEncoder

Vincenzo Crescimanna, Bruce Graham

PDF

TL;DR

This paper introduces the Variational InfoMax (VIM), a new learning objective for VAEs that optimizes both inference and generative models while controlling network capacity to improve informativeness and robustness.

Contribution

The paper proposes the VIM objective, which simultaneously learns inference and generative models with explicit capacity control, addressing limitations of the ELBO in VAEs.

Findings

01

VIM improves the informativeness of the generator.

02

VIM provides explicit capacity estimation.

03

VIM enhances network robustness.

Abstract

The Variational AutoEncoder (VAE) learns simultaneously an inference and a generative model, but only one of these models can be learned at optimum, this behaviour is associated to the ELBO learning objective, that is optimised by a non-informative generator. In order to solve such an issue, we provide a learning objective, learning a maximal informative generator while maintaining bounded the network capacity: the Variational InfoMax (VIM). The contribution of the VIM derivation is twofold: an objective learning both an optimal inference and generative model and the explicit definition of the network capacity, an estimation of the network robustness.

Tables5

Table 1. TABLE I : NLL for generated samples (smaller is better)

	NLL		FID
Method	MNIST	Omniglot	CIFAR10	CelebA
VAE	1158	1224	168	82
$β_{H}$ -VAE	1113	1254	262	-
$β_{A}$ -VAE	1123	1228	174	89
VIMAE-n	1169	1190	103	56
VIMAE-l	1171	1223	104	55

Table 2. TABLE II : Reconstruction accuracy, ∥ ⋅ ∥ 2 \|\cdot\|_{2} over 100 samples, (smaller is better)

	Reconstruction accuracy, $∥ \cdot ∥_{2}$
Method	MNIST	Omniglot	CIFAR10	CelebA
VAE	0.51	0.75	8.29	17.29
$β_{H}$ -VAE	0.62	0.75	9.8	-
$β_{A}$ -VAE	0.5	5.7	89	17.6
VIMAE-n	0.47	0.75	4.74	16.65
VIMAE-l	0.48	0.76	4.85	16.74

Table 3. TABLE III : Semi-supervised classification CIFAR10.

Method	$ν = 0$	$𝒩 (0, {0.3}^{2})$	$ℬ (0.2)$
	accuracy (%)
VAE	30	25	16
$β_{H}$ -VAE	29	26	19
$β_{A}$ -VAE	31	31	18
VIMAE-n	29	28	23
VIMAE-l	32	34	23

Table 4. TABLE IV : Semi-supervised classification, MNIST.

	accuracy (%)
Method	$ν = 0$	$ν = 𝒩 (0, σ^{2})$		$ν = ℬ (p)$
		$0.2$	$0.4$	$0.2$	$0.5$
VAE	80	77	70	72	52
$β_{H}$ -VAE	92	86	82	91	84
$β_{A}$ -VAE	93	66	13	85	65
VIMAE-n	93	92	86	92	86
VIMAE-l	93	92	88	92	87

Table 5. TABLE V : Semi-supervised classification, Omniglot (random sampling: 20%).

	accuracy (%)
Method	$ν = 0$	$ν = 𝒩 (0, σ^{2})$		$ν = ℬ (p)$
		$0.2$	$0.4$	$0.2$	$0.5$
VAE	22	22	17	22	16
$β_{H}$ -VAE	21	21	22	19	17
$β_{A}$ -VAE	22	22	21	21	24
VIMAE-n	22	23	24	22	22
VIMAE-l	24	23	20	23	22

Equations28

D_{KL}(p(x)||q(x))=\int\log\Big(\frac{p(y)}{q(y)}\Big{missing})p(y)dy

D_{KL}(p(x)||q(x))=\int\log\Big(\frac{p(y)}{q(y)}\Big{missing})p(y)dy

I (X, Z) = D_{K L} (p (x, z) ∣∣ p (x) p (z)),

I (X, Z) = D_{K L} (p (x, z) ∣∣ p (x) p (z)),

C (X, Z) = p (z) \in P sup I (X, Z)

C (X, Z) = p (z) \in P sup I (X, Z)

E L B O_{θ, ϕ} = E_{p (x)} E_{q (z ∣ x)} [lo g p (x ∣ z)] - D_{K L} (q_{ϕ} (z ∣ x) ∣∣ p (z))],

E L B O_{θ, ϕ} = E_{p (x)} E_{q (z ∣ x)} [lo g p (x ∣ z)] - D_{K L} (q_{ϕ} (z ∣ x) ∣∣ p (z))],

I_{θ} (X, Z) = h_{θ} (X) - h_{θ} (X ∣ Z),

I_{θ} (X, Z) = h_{θ} (X) - h_{θ} (X ∣ Z),

h_{θ, ϕ} (X ∣ Z) = h_{ϕ} (X ∣ Z) + D_{K L} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z ∣ x)) s.t q_{ϕ} (z) = p (z),

h_{θ, ϕ} (X ∣ Z) = h_{ϕ} (X ∣ Z) + D_{K L} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z ∣ x)) s.t q_{ϕ} (z) = p (z),

V I M_{θ, ϕ} = h_{θ, ϕ} (X ∣ Z) - λ D (q_{ϕ} (z) ∣∣ p (z)), λ > 0

V I M_{θ, ϕ} = h_{θ, ϕ} (X ∣ Z) - λ D (q_{ϕ} (z) ∣∣ p (z)), λ > 0

V I M_{θ, ϕ} = - D_{K L} (p (x) ∣∣ p_{θ} (x)) - D_{K L} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z ∣ x)) - (λ - 1) D_{K L} (q_{ϕ} (z) ∣∣ p (z)) + I_{ϕ} (X, Z) .

V I M_{θ, ϕ} = - D_{K L} (p (x) ∣∣ p_{θ} (x)) - D_{K L} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z ∣ x)) - (λ - 1) D_{K L} (q_{ϕ} (z) ∣∣ p (z)) + I_{ϕ} (X, Z) .

C_{θ} (X, Z) = θ, p (z) \in P sup I_{θ} (X, Z) .

C_{θ} (X, Z) = θ, p (z) \in P sup I_{θ} (X, Z) .

θ max I_{θ} (X, Z) s.t. C_{θ} (X, Z) = h (Z) .

θ max I_{θ} (X, Z) s.t. C_{θ} (X, Z) = h (Z) .

E_{q (z ∣ x)} [- lo g p_{θ} (x ∣ z)] - β E_{x} [D_{K L} (q (z ∣ x) ∣∣ p (z))], β > 1.

E_{q (z ∣ x)} [- lo g p_{θ} (x ∣ z)] - β E_{x} [D_{K L} (q (z ∣ x) ∣∣ p (z))], β > 1.

E_{q (z ∣ x)} [- lo g p_{θ} (x ∣ z)] - β ∣ C - E_{x} [D_{K L} (q (z ∣ x) ∣∣ p (z))] ∣.

E_{q (z ∣ x)} [- lo g p_{θ} (x ∣ z)] - β ∣ C - E_{x} [D_{K L} (q (z ∣ x) ∣∣ p (z))] ∣.

Z max I (Z, T) - β I (X, Z) .

Z max I (Z, T) - β I (X, Z) .

MMD (q (z), p (z)) = f : ∥ f ∥_{H_{k}} \leq 1 sup E_{p (z)} [f (Z)] - E_{q (z)} [f (Z)]

MMD (q (z), p (z)) = f : ∥ f ∥_{H_{k}} \leq 1 sup E_{p (z)} [f (Z)] - E_{q (z)} [f (Z)]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSolana Customer Service Number +1-833-534-1729

Full text

The Variational InfoMax AutoEncoder

Vincenzo Crescimanna, Bruce Graham

Department of Computer Science

University of Stirling

Stirling, UK

{vincenzo.crescimanna1, bruce.graham}@stir.ac.uk

Abstract

The Variational AutoEncoder (VAE) learns simultaneously an inference and a generative model, but only one of these models can be learned at optimum, this behaviour is associated to the ELBO learning objective, that is optimised by a non-informative generator. In order to solve such an issue, we provide a learning objective, learning a maximal informative generator while maintaining bounded the network capacity: the Variational InfoMax (VIM). The contribution of the VIM derivation is twofold: an objective learning both an optimal inference and generative model and the explicit definition of the network capacity, an estimation of the network robustness.

I Introduction

A common assumption in machine learning is that any visible data $x\in\mathcal{X}$ is completely described by some generative factor $o$ , living in a smaller hidden space $\mathcal{O}$ , i.e. $x=g(o)$ with $g$ a (possibly stochastic) generative function. The aim of unsupervised representation learning research is to find a representation $z$ of the generative factor $o$ living in a known space $\mathcal{Z}$ describing, as well as $o$ , the visible data $x$ . The relevance of such task is twofold: since the learnt small representation $z$ is task agnostic it can be used as input for networks performing different tasks (generalisation property), [1], and also because such representation allows to interpret what is learning the network in its hidden layers [2].

Many models $f_{\phi}:\mathcal{X}\to\mathcal{Z}$ , parametrising an inference distribution $q_{\phi}(z|x)$ , have been proposed [3, 4, 5, 6], but recently in order to solve this problem it was proposed to consider a dual problem: define a priori $z$ and find a generator map $g_{\theta}$ , such that for any $z$ , $g_{\theta}(z)$ is an element of $\mathcal{X}$ . In particular, two families of probabilistic generative models have become dominant: Variational AutoEncoder (VAE) [7, 8] and Generative Adversarial Network (GAN) [9]. The common idea of the two approaches is that a good generator $p_{\theta}(x|z)$ is the one able to generate the data that is as close as possible to the visible one, i.e. that with respect a certain metric $D$ , the distance between the marginal $p_{\theta}(x)=\mathbb{E}_{p(z)}[p_{\theta}(x|z)]$ and the visible distribution $p_{D}(x)$ is minimal.

In this manuscript we restrict our attention to the VAE model, since by its architecture, it is the only one learning an inference model $q_{\phi}(z|x)$ , and where the learnt representation, possibly from different datasets, can be used as input for networks performing different tasks, [10, 11]. Although VAE, by its training robustness and general good generative performance is the most popular representation learning model, in particular cases it suffers from the uninformative representation issue: the representation does not separate out the generative factors, and generator model relies on the weights information.

Following the direction suggested in [12, 13] we propose an information theoretic analysis of the VAE. Such description lead us to two observations: it is possible to learn both an informative representation and a generative model, and that is necessary to bound the network capacity in order to have a generator that does not relies on the weights and then more robust to noise [14]. In light of this two observations we suggest to optimise the VAE according to the the Variational InfoMax (VIM) a variational objective, lower bound of the theoretical principlee: Capacity Constrained InfoMax (CCIM), ensuring to learn a maximally informative generator while maintaining bounded the network capacity.

The theoretical deductions, that inference and generative tasks are not orthogonal, and that an high capacity network although more informative, is prone to overfit, are confirmed by the computational experiments, where we compare the principal variants of the classic VAE [12], [15], with two AutoEncoders, having different network capacity, optimising VIM.

We conclude this section summarising the contribution of the paper in the following points:

•

derivation of a variational lower bound for the maximal mutual information of a generative model belonging in a certain family, see (5);

•

proposal of a new learning principle for unsupervised models: the Capacity-Constrained InfoMax, see (8), that allows both to learn a good representation while maintaining optimal generative performance;

•

highlight the role of the latent entropy as a bound of the network capacity, see (8);

•

observe that a small capacity network is more robust to noise and then associated with a better inference model; see experiment section.

The work is divided as follows: in the second section we describe briefly the VAE and its variants; in the third and fourth sections we describe the variational infomax method and related work. We conclude the paper with the experimental results and the final observations.

II Background

II-A Notation and preliminary definitions

We use calligraphic letters (i.e. $\mathcal{X}$ ) for sets, capital letters (i.e. $X$ ) for random variables, and lower case letters (i.e. $x$ ) for their samples. With abuse of notation we denote both the probability and the corresponding density with the lower case letters (i.e. $p(x)$ ).

KL divergence

Given two random distributions $p(x)$ and $q(x)$ , the Kullback-Leibler (KL) divergence

[TABLE]

is an (intuitive) measure of the distance between the distributions $p$ and $q$ .

Mutual Information and Capacity

Given a channel $Z\to X$ with $X$ and $Z$ random variables, jointly distributed according to $p(x,z)$ and with marginals $p(x)$ and $p(z)$ . The mutual information

[TABLE]

is a measure of the reduction of uncertainty in $X$ due to the knowledge of $Z$ , and the capacity

[TABLE]

is the maximal information that can be shared for a fixed generator $p(x|z)$ .

II-B Variational autoencoder

From now on let us assume that the unknown distribution of the data $p(x)$ coincides with the empirical one $p_{D}(x)$ , and that the distribution of the latent representation $p(z)$ is known. In this context the VAE is a model solving the following optimisation problem: find the generative model $p_{\theta}(x,z)\in\mathcal{P}_{\theta}$ , specified by the parameters $\theta$ of the associated neural network, maximising the ELBO objective

[TABLE]

a lower bound of the unfeasible-to-compute marginal likelihood $\mathbb{E}_{p(x)}[\log p_{\theta}(x)]$ . The ELBO objective is optimised by a regularised autoencoder, with encoder and decoder parametetrising, respectively, the inference and generative distributions, $q_{\phi}(z|x)$ and $p_{\theta}(x|z)$ , with $\phi\in\Phi$ , $\theta\in\Theta$ and regulariser defined by the rate term $\mathbb{E}_{p(x)}[D_{KL}(q_{\phi}(z|x)||p(z))]$ , an upper bound of the encoding information $I_{\phi}(Z,X)=\mathbb{E}_{p(x)}[D_{KL}(q_{\phi}(z|x)||q(z))]$ .

II-C Uninformative representation issue

The generator optimising the ELBO is the one such that its marginal $p_{\theta}(x)$ is minimising the divergence $D_{KL}(p_{\theta}(x)||p(x))$ , a quantity that is independent from the inference distribution $q_{\phi}(\cdot|x)$ and the hidden representation $z$ . That means that optimising the ELBO objective does not guarantee an useful inference or generative model; indeed, in the case of really powerful generator model, the following catastrophic scenarios, are not rare :

•

useless generative model: for any representation $z$ it is generated a sample from $p(x)$ , since $p_{\theta}(x|z)=p_{\theta}(x)$ , i.e. the information about the generated variable $X$ come from the weights $\theta$ ,

•

uninterruptible representation: in the latent space is impossible to identify the generative factors, since the learned representations are independent from the visible data, i.e. $q_{\phi}(z|x)=q_{\phi}(z)$ .

Since both the scenarios are associated to a null information between $X$ and $Z$ , respectively $I_{\theta}(X,Z)$ and $I_{\phi}(X;Z)$ and observing, by Data Processing Inequality [16], that $I(g_{\theta}(Z),Z)\leq I(Z,X)$ , in the next section we derive a variational objective, learning a maximal informative generator.

III The Model

III-A The Variational InfoMax

Assuming the distribution associated to the two random variable $O$ is known and $p(z)=p(o)$ , the InfoMax objective is defined as: find the joint distribution $p_{\theta}(x,z)\in\mathcal{P}_{\theta}:=\{p_{\theta}(x,z):\mathbb{E}_{p(z)}[p_{\theta}(x|z)]=p(x),\quad\mathbb{E}_{p(x)}[p_{\theta}(z|x)]=p(z)\}$ maximising the mutual information $I_{\theta}(X,Z)=D_{KL}(p_{\theta}(x,z)||p(x)p(z))$ , i.e. find $\theta^{*}\in\Theta$ s.t. $I_{\theta^{*}}\geq I_{\theta}$ for any $\theta\in\Theta$ .

Since the definition via KL divergence is computationally intractable, it is necessary to re-write the mutual information as

[TABLE]

where $h_{\theta}(X)=-\mathbb{E}_{p_{\theta}(x)}[\log p_{\theta}(x)]$ is the entropy of $X$ , and $h_{\theta}(X|Z)=-\mathbb{E}_{p_{\theta}(x,z)}[\log p_{\theta}(x|z)]$ is the conditional entropy $h_{\theta}(X|Z)$ . Since $p_{\theta}(x,z)\in\mathcal{P}_{\theta}$ the entropy $h_{\theta}(X)=h(X)$ is constant, and in order to maximise the mutual information it is sufficient to minimise the conditional entropy.

Excluding some special cases [17], minimising the conditional entropy $h_{\theta}(X|Z)$ is unfeasible, so it is necessary to consider an associated variational problem: for any $q_{\phi}(z|x)$ such that $q_{\phi}(z)=p(z)$ , learn the generative model $p_{\theta}(x|z)$ minimising the reconstruction accuracy term $h_{\theta,\phi}(X|Z)=\mathbb{E}_{p(x)}[\mathbb{E}_{q_{\phi}(z|x)}[\log(p_{\theta}(x|z))]]$ . Indeed, the cross-entropy

[TABLE]

is minimised when $q_{\phi}(z|x)=p_{\theta}(z|x)=p_{\theta^{*}}(z|x)$ .

Unfortunately, the objective in (4) is still unfeasible to compute, because it requires that $q_{\phi}(z)=p(z)$ , but by the butterfly architecture of the autoencoder, $q_{\phi}(z)$ tends to be uniformly distributed on the space $\mathcal{Z}$ . For this reason, we have to consider the following relaxed form:

[TABLE]

where it is introduced a term $D(q_{\phi}(z)||p(z))$ encouraging the empirical distribution $q_{\phi}(z)$ to be close, according to the metric $D$ , to $p(z)$ . In the following we assume $D=D_{KL}$ , and in order to avoid any confusion the variational autoencoder trained maximising (5) will be dubbed VIMAE.

The derived objective is learning an maximally informative decoder, but by description (5) is not clear if the autoencoder learns an useful representation. To answer this question we have to consider the following equivalent description, [13]:

[TABLE]

Thanks to the dual definition (6), we see that the generator $p_{\theta}(x|z)$ is an actual generator since its marginal is close to the visible distribution (first term), and that the learned representation is maximally informative (fourth term), with maximal information bounded by the entropy $h_{\theta}(z)$ (third term), and finally that the generative model does not relie on the weight information (second term), indeed if by contradiction $p_{\theta}(x|z)=p_{\theta}(x)$ , we have a minimal encoding information, $I_{\phi}=0.$

III-B Channel capacity

In the ideal setting above, where the parameter families and the correct prior was known, we derived that the optimal solution is obtained by the generator having mutual information coincident with the network capacity

[TABLE]

Since, the decoding information is bounded by the inference distribution, see (6), and the latter cannot grow more than the entropy $h(Z)$ , optimal case that holds true in the no-noise channel case, we can assert that the VIM solution is the one optimising the following objective:

[TABLE]

Let us observe that the capacity constrain is fundamental to ensure that the information about the generated variable $X$ came from the variable $Z$ and not from the weights $\theta$ ; indeed, the generative information $I_{\theta}(X;Z)$ can grow potentially up to $h(X)$ , $p_{\theta}(x|z)=p_{\theta}(x)$ , when the information about $Z$ is at most $h(Z)$ . In the case where the generative model relies on the weight information we will say that the model is overfitting, [14].

By description above, we see that the choice of the, generally unknown, prior $p(z)$ plays a determinant role in the learning performance. Indeed, the relationship between capacity and network entropy suggests that a network with high entropy is more prone to overfit: the inference network is leaning unnecessary property about the data and the generative model relies on the weights information, $I(X,\theta)$ . In order to test the deduction that an high capacity network is more prone to overfit, in the experiments (see below) we consider the cases $Z$ is Normal (VIMAE-n) or Logistic (VIMAE-l) distributed. We choose to compare the popular Normal distribution with the Logistic one for two reasons: the Logistic has less entropy than a Gaussian distribution and because it is a common assumption in natural science to suppose that the hidden factors of the visible data are logistically distributed [18].

IV Related work

Autoencoder literature

Autoencoder models are one of the most used family of neural networks to extract features in an unsupervised way [19], and their relationship with Information Theory is well-established from the first unregularised autoencoders [20]. The classical unregularised autoencoders, minimising the reconstruction loss $\mathbb{E}_{p(x)}[\mathbb{E}_{q_{\phi}(z|x)}[-\log p_{\theta}(x|z)]]$ , are maximising an unbounded information, i.e. they are looking for a solution in the space $\tilde{\mathcal{P}}_{\theta}=\{p_{\theta}:p_{\theta}(x)=p(x)\}$ . In general, a solution in this wide space is good only for reconstruction performance because $Z$ contains all the possible information that can be stored in the space $\mathcal{Z}$ , and is not robust to input noise [21]; but, as observed in [22], if $q_{\phi}(z|x)\sim\mathcal{N}(\mu(x),\sigma(x))$ the model $p_{\theta}(x|z)$ is robust to noise and is a Gaussian generator. In this context the uninformative issue is avoided, but the price to pay is the impossibility to sample directly from a prior $p(z)$ that is not defined; indeed, the model described in [22] requires running relatively expensive Markov Chain to obtain samples.

Many regularised models have been proposed, but the most well known is VAE, that minimises the expected code length of communicating $x$ . As we observed in the previous sections, it is not guaranteed that the method finds a useful representation. Such issue can be solved both controlling the information of the model, or considering a more flexible prior $p(z)$ ,[23, 24, 3]; the latter approach with free-inference model is the one obtaining best generative performance, but the inference model is difficult to interpret and to be used for different tasks from which was trained [10], for this reason, in this manuscript we do not consider the latter approach.

VAE alternatives

As observed above, a wrong choice of the prior $p(z)$ can be associated to a big encoding information and then to a not useful inference. For this reason in [15] was proposed the $\beta$ -VAE, a variational autoencoder optimising the following variant of the ELBO

[TABLE]

In this way it is controlled the mutual information $I_{\phi}(Z;X)$ , obtaining a better inference. Unfortunately in this way it is high the risk of an informative generative model, having both pour generative quality and reconstruction accuracy. Indeed, bounding the encoding information while maintaining a fixed high entropy prior, means that the reconstruction term is minimised relying on the weights information. Moreover the choice of the parameter $\beta$ is tricky, since to an higher $\beta$ corresponds an higher probability to have a non-informative representation, $D_{KL}(q(z|x)||p(z))$ . For this reason, in order to have both good inference and generative performance was suggested to optimise the following objective:

[TABLE]

In this way it is guaranteed to learn an inference information $I_{\phi}(X,Z)$ lower than $C$ . Let us observe that the objective (10) coincides with the VIM (5) in the case $C=h(Z)$ , but differently from VIM, this approach has two main issues, one theoretic and the other computational. Theoretically, we solve only the uninformative issue, maintaining high the risk of a weight dependent generator, for the same reason of the $\beta$ -VAE. Computationally, the learning principle (10) is often intractable.

Starting from different research point of views, respectively minimal cost generation and maximally informative inference, the objective in (5) was firstly derived in [13] and[25]. The main difference between our manuscript and the cited researches lies on the information theoretical analysis, and in particular on the description of the network capacity role and its relationship with the entropy of the latent prior. Finally we underline that is possible to consider in (5) distance measures different from the Kullback-Leibler divergence, for example in case we wish to consider a Jensen-Shannon divergence in (5) it is necessary to consider an adversarial network model, discriminating the true samples $z\sim p(z)$ from the fake sampled by $q_{\phi}(z)$ [9]. In the latter case the obtained model is equivalent to the Adversarial AutoEncoder [26].

Information theoretic literature

Information theory is strongly related with neural networks, and not only with autoencoders. Originally the InfoMax objective was applied to a self-organised system with a single hidden layer, [17, 27] where the bound in the capacity was given by the numbers of hidden neurons. More recently, the (naive) InfoMax has given way to a new information-theoretic principle: the Information-Bottleneck [28]. The idea of this principle is that a feed-forward neural network trained for task $T$ tends to learn a minimal sufficient representation of the data, maximising the following objective:

[TABLE]

Although it was shown that in the general case this principle does not hold true [29], the principle was used as a regularisation technique with success both in unsupervised [12, 15] and supervised [30] settings. We observe that the CCIM, (8), and IB, (11), coincide in the case of a deterministic encoder, where the encoding information is the entropy of $Z$ .

V Experiments

In this section we empirically evaluate both the generative and inference model learned optimised VIM (5), we highlight the relationship between the network capacity and robust inference, and we compare the two VIMAE variants: VIMAE-n ( $p(z)\sim\mathcal{N}(0,1)$ ) and VIMAE-l ( $p(z)\sim Logistic(0,1)$ ), with the solution learned by ELBO (2) and its principal variants (9) and (10). In all the described experiments, the divergence $D_{KL}(q(z)||p(z))$ in (5) is approximated via the Maximum Mean Discrepancy [13] defined as:

[TABLE]

where $\mathcal{H}_{k}$ is the Reproducing Kernel Hilbert Space associated to a positive definite kernel $k(\cdot,\cdot):\mathcal{Z}\times\mathcal{Z}\to\mathbb{R}_{+}$ , and $f$ a map living in $\mathcal{H}_{k}$ , i.e. $f:\mathcal{Z}\to\mathbb{R}$ such that $\langle f,k(x,\cdot)\rangle_{\mathcal{H}}=f(x)$ .

Moreover, by difficulties to compute the objective (10), as suggested in [12] we decided to optimise a $\beta$ -VAE, denoted $\beta_{A}$ -VAE, with $\beta<1$ ; in order to avoid any confusion, the original version proposed in [15] with $\beta\gg 1$ will be renamed $\beta_{H}$ -VAE.

The experiments were performed with the same settings and autoencoder models used in [25], an architecture similar to the DCGAN [6] with batch normalisation [31] (more details given in the Appendix). We consider four data-sets: MNIST, CIFAR10, and Omniglot three standard data-sets with ground-truth labels, to evaluate both the generative and inference models; and the CelebA [32], a large entropic dataset consisting of roughly of 203k faces of $64\times 64$ resolution, in order to evalute the generative performance.

After considering many parameters for $\beta_{H}$ , $\beta_{A}$ and $\lambda$ , we choose, in accordance with what was suggested in [25], $\beta_{H}=\lambda=10$ , and $\beta_{A}=0.2$ for MNIST and Omniglot and $\beta=\lambda=100$ and $\beta_{A}=0.4$ for CelebA and CIFAR10 experiments.

V-A Decoding information

In this section we estimate the informativeness of the learnt generative model $p_{\theta}(x|z)$ , evaluating both the generated sample quality and reconstruction accuracy of the associated model.

Given a representation variable $Z\sim p(Z)$ , a model $p_{\theta}(\cdot|z)$ is said a good generator if the generated random variable $X_{g}\sim p_{\theta}(X)$ , is close with respect a distance measure $D$ , to the visible random variable $X_{v}\sim p(X)$ . To evaluate the similarities between the generated sample and visible data, we consider two classic metrics: the Negative Log-Likelihood (NLL) for the grey-scale pictures, and the Frechet Inception Distance (FID), for the RGB datasets. The reason why we consider two different measures is twofold: firstly because FID, an estimation of the Frechet distance $\|X_{g}-X_{v}\|_{2}^{2}$ does not work well in the gray-scale setting, and secondly to highlight that a model with minimal NLL, or equivalently with minimal divergence $D_{KL}(p(x)||p_{\theta}(x))$ , is not often the most informative or with sharper samples.

Let us start considering the experiments on grey-scale dataset (MNIST and Omniglot), although the NLL associated to each model is similar, see table I, we observe from figures 1 and 2 that the quality of the generated samples differ. Indeed the samples generated by ELBO models are blur, that is because the information between the weights and the generated sample is high and then the generated data $\{x_{i}\}_{i}$ are close to their average value $\bar{x}=\mathbb{E}[p_{\theta}(x)]$ , this is consistent with the ELBO objective, indeed $D_{KL}(p(x)||p_{\theta}(\bar{x}))=0$ . This phenomenon appears particularly clear in the MNIST setting, where by the simplicity (small entropy) of the dataset, a model like $\beta_{H}$ -VAE bounding the information between $X$ and $Z$ obtains optimal NLL performance.

More explanatory are the experiments in the challenging RGB setting. Where the VIM models have the best generative performance, as we can see from figures 3, 4 and by the FID score in table I. Moreover we observe that in this context $\beta_{H}$ -VAE has poor results, for example in the CelebA we was not able to train a model with $\beta\gg 1$ .

In order to confirm, that the VIMAE generators are the most informative, in table II we compare the reconstruction losses, a rude estimation of the decoding information, see (5). According to the description made above we see, that apart from the Omniglot where all the models perform in a similar way, the VIMAE models have the best reconstruction performance in all the settings; in particular, in this task VIMAE-n performs better than VIMAE-l, in accordance to the idea that a small prior entropy is associated with a small capacity, and then less informative.

Finally, we underline that the $\beta_{H}$ -VAE, theoretically similar to the VIMAE behaves in different way, in particular it performs worse than classical VAE, this phenomenon is in agreement with the idea that a bigger capacity network tends naturally to overfit.

V-B Encoding information

A good inference model is the one learning a representation where the generative factors are separate out (disentangled) and robust to noise. In order to evaluate such properties, following the approach proposed in [1], we evaluate the accuracy of a supervised model directly trained on the feature space $\mathcal{Z}$ . In particular, to evaluate the disentanglement property we consider the semi-supervised procedure used in [13]: we train the M1+TSVM [33] on the feature data learnt by the autoencoder and use the classification accuracy over 1000 (100 for Omniglot) samples as an approximate metric to evaluate how much relevant are the learnt representation for a classification task. In order to evaluate the robustness of the learned features, we performed the same algorithm on the representation associated to corrupted data, i.e. $z\sim q(z|x+\nu)$ , considering two types of noise: Gaussian and mask. In the Gaussian case, we add to each pixel a $\nu$ value sampled from $\mathcal{N}(0,\sigma^{2})$ with $\sigma\in\{0.2,0.3,0.4\}$ , and in the masking case a fraction $\nu$ of the elements is forced to be 0: each pixel is masked according to a Bernoulli distribution $\mathcal{B}(p),p\in\{0.2,0.5\}$ . Higher classification performance suggests that the learned representation contains the relevant information and, in case of corrupted input data, that it is robust. In the Omniglot case by the challenge of the task (the test alphabet was never seen in the training) we consider a 5-character data-set, split into 300 ( $60\times 5$ ) for training and 100 for evaluation.

From the classification scores listed in tables III- V, we see that the ELBO-based model learnt good representations for clean data, but not when corrupted data is given as input. This is particularly clear in the Bernoulli corruption case, that is a noise different from the one seen in the training. Particularly relevant is the behaviour of the two VIMAEs: they are comparable in the cases of clean data and small noise, but the one with big capacity, VIMAE-n, suffers in large noise setting, while the one with small capacity, VIMAE-l, is the most robust and in some challenging cases, see table III, the noise helps to improve the model accuracy. Such a result is consistent with the idea that a small capacity network is learning the relevant factors of the input data, that are the only ones robust to the input noise.

VI Conclusion

We observed, via an information theoretic description of VAE, that it is possible to learn a good generative model while maintaining a meaningful hidden representation, and that goal can be reached by optimising the CCIM, an objective that separates out the two properties of a network: the generative information and its capacity. We underlined the relationship between robustness and network capacity and how that one can be defined by the prior $p(z)$

The definition of the network capacity, and its strictly relationship with the choice of the latent prior, suggests that the VIMAE could be used in tasks where it is necessary to modify the network capacity continually. For example, in the Life-Long learning case where the choice of the network capacity is fundamental in order to avoid the catastrophic forgetting issue [10]. In the light of the good performance of the CCIM objective, and its relationship with the Information Bottleneck, future work include the generalisation of the CCIM to the supervised case where the Information Bottleneck, is considered the best option [30].

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin, and X. Glorot, “Higher order contractive auto-encoder,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases . Springer, 2011, pp. 645–660.
2[2] Z. C. Lipton, “The mythos of model interpretability,” Queue , vol. 16, no. 3, pp. 31–57, 2018.
3[3] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” ar Xiv preprint ar Xiv:1605.08803 , 2016.
4[4] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation , vol. 18, no. 7, pp. 1527–1554, 2006.
5[5] C. J. Maddison, J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. Teh, “Filtering variational objectives,” in Advances in Neural Information Processing Systems , 2017, pp. 6573–6583.
6[6] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” ar Xiv preprint ar Xiv:1511.06434 , 2015.
7[7] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” ar Xiv preprint ar Xiv:1312.6114 , 2013.
8[8] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” ar Xiv preprint ar Xiv:1401.4082 , 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

The Variational InfoMax AutoEncoder

Abstract

I Introduction

II Background

II-A *Notation and preliminary definitions *

KL divergence

Mutual Information and Capacity

II-B Variational autoencoder

II-C Uninformative representation issue

III The Model

III-A The Variational InfoMax

III-B Channel capacity

IV Related work

Autoencoder literature

VAE alternatives

Information theoretic literature

V Experiments

V-A Decoding information

V-B Encoding information

VI Conclusion

II-A Notation and preliminary definitions