Disentangling Latent Space for VAE by Label Relevant/Irrelevant Dimensions
Zhilin Zheng, Li Sun

TL;DR
This paper proposes a novel VAE method that disentangles latent space into label relevant and irrelevant parts, improving class-specific representation and avoiding posterior collapse.
Contribution
It introduces a disentangled latent space with class-specific Gaussian mixture distribution and demonstrates theoretical equivalence to mutual information maximization.
Findings
Disentangled latent space improves class-specific representation.
The method is extendable to GANs for high-quality image synthesis.
Theoretical analysis shows equivalence to KL divergence on joint distribution.
Abstract
VAE requires the standard Gaussian distribution as a prior in the latent space. Since all codes tend to follow the same prior, it often suffers the so-called "posterior collapse". To avoid this, this paper introduces the class specific distribution for the latent code. But different from CVAE, we present a method for disentangling the latent space into the label relevant and irrelevant dimensions, and , for a single input. We apply two separated encoders to map the input into and respectively, and then give the concatenated code to the decoder to reconstruct the input. The label irrelevant code represent the common characteristics of all inputs, hence they are constrained by the standard Gaussian, and their encoder is trained in amortized variational inference way, like VAE. While…
| FaceScrub | CIFAR-10 | |
|---|---|---|
| cVAE-GAN [3] | 0.0141 | 0.0136 |
| ours | 0.0157 | 0.0149 |
| Encoder | Decoder | Discriminator |
|---|---|---|
| input | input | input |
| conv, 32, stride 2, batchnorm, relu | concat | conv, 32, stride 1, lrelu |
| conv, 64, stride 2, batchnorm, relu | fc, 1024, batchnorm, relu | conv, 128, stride 2, lrelu |
| conv, 128, stride 2, batchnorm, relu | conv, 256, stride 2, batchnorm, relu | conv, 256, stride 2, lrelu |
| conv, 256, stride 2, batchnorm, relu | conv, 256, stride 1, batchnorm, relu | conv, 256, stride 2, lrelu |
| fc, 1024, batchnorm, relu | conv, 128, stride 2, batchnorm, relu | fc, 512, lrelu |
| fc, 100 (for ) / 200 (for | conv, 64, stride 2, batchnorm, relu | fc, 1 |
| conv, 32, stride 2, batchnorm, relu | ||
| conv, 3, stride 1, tanh |
| Discriminator for FaceScrub |
|---|
| input |
| conv, 64, stride 2, lrelu |
| conv, 128, stride 2, lrelu |
| conv, 256, stride 1, lrelu |
| conv, 256, stride 2, lrelu |
| conv, 512, stride 1, lrelu |
| conv, 512, stride 2, lrelu |
| conv, 512, stride 2, lrelu |
| global average pooling |
| fc, 1024, lrelu |
| fc, 1 |
| CUB-200-2011 | Cifar-100 | |
|---|---|---|
| cVAE-GAN [3] | 0.0195 | 0.0179 |
| ours | 0.0192 | 0.0190 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Advanced Image and Video Retrieval Techniques
MethodsUSD Coin Customer Service Number +1-833-534-1729 · Convolution · Dogecoin Customer Service Number +1-833-534-1729
Disentangling Latent Space for VAE by Label Relevant/Irrelevant Dimensions
Zhilin Zheng1 Li Sun1
1 Shanghai Key Laboratory of Multidimensional Information Processing,
East China Normal University
[email protected] [email protected]
Abstract
VAE requires the standard Gaussian distribution as a prior in the latent space. Since all codes tend to follow the same prior, it often suffers the so-called ”posterior collapse”. To avoid this, this paper introduces the class specific distribution for the latent code. But different from cVAE, we present a method for disentangling the latent space into the label relevant and irrelevant dimensions, and , for a single input. We apply two separated encoders to map the input into and respectively, and then give the concatenated code to the decoder to reconstruct the input. The label irrelevant code represent the common characteristics of all inputs, hence they are constrained by the standard Gaussian, and their encoder is trained in amortized variational inference way, like VAE. While is assumed to follow the Gaussian mixture distribution in which each component corresponds to a particular class. The parameters for the Gaussian components in encoder are optimized by the label supervision in a global stochastic way. In theory, we show that our method is actually equivalent to adding a KL divergence term on the joint distribution of and the class label , and it can directly increase the mutual information between and the label . Our model can also be extended to GAN by adding a discriminator in the pixel domain so that it produces high quality and diverse images.
1 Introduction
Learning a deep generative model for the structured image data is difficult because this task is not simply modeling a many-to-one mapping function such as the classification, instead it is often required to generate diverse outputs for similar codes sampled from a simple distribution. Furthermore, image in the high dimension space often lies in a complex manifold, thus the generative model should capture the underlying data distribution .
Basically, Variational Auto-Encoder (VAE) [34, 20] and Generative Adversarial Network (GAN) [13, 25] are two strategies for structured data generation. In VAE, the encoder maps data into the code in latent space. The decoder, represented by , is given a latent code sampled from a distribution specified by the encoder and tries to reconstruct . The encoder and decoder in VAE are trained together mainly based on the data reconstruction loss. At the same time, it requires to regularize the distribution to be simple (e.g. Gaussian) based on the Kullback-Leibler (KL) divergence between and , so that the sampling in latent space is easy. Optimization for VAE is quite stable, but results from it are blurry. Mainly because the posterior defined by is not complex enough to capture the true posterior, also known for ”posterior collapse”. On the other hand, GAN treats the data generation task as a min/max game between a generator and discriminator . The adversarial loss computed from the discriminator makes generated image more realistic, but its training becomes more unstable. In [10, 22, 28], VAE and GAN are integrated together so that they can benefit each other.
Both VAE and GAN work in an unsupervised way without giving any condition of the label on the generated image. Instead, conditional VAE (cVAE) [39, 3] extends it by showing the label for both encoder and decoder. It learns data distribution conditioned on the given label. Hence, the encoder and decoder become and . Similarly, in conditional GAN (cGAN) [9, 18, 33, 30] label is given to both generator and discriminator . Theoretically, feeding label to either the encoder in VAE or decoder in VAE or GAN helps increasing the mutual information between the generated and the label . Thus, it can improve the quality of generated image.
This paper deals with image generation problem in VAE with two separate encoders. For a single input , our goal is to disentangle the latent space code , computed by encoders, into the label relevant dimensions and irrelevant ones . We emphasize the difference between and , and their corresponding encoders. For , since label is known during training, it should be more accurate and specific. While without any label constraint, should be general. Specifically, the two encoders are constrained with different priors on their posterior distributions and . Similar with VAE or cVAE, in which the full code is label irrelevant, the prior for is also chosen . But different from previous works, the prior becomes complex to capture the label relevant distribution. From the decoder’s perspective, it takes the concatenation of and to reconstruct the input . Here the distinction with cVAE and cGAN is that they uses the fixed, one-hot encoding label, while our work applies , which is considered to be a variational, soft label.
Note that there are two stages for training our model. First, the encoder for gets trained for classification task under the supervision of label . Here instead of the softmax cross entropy loss, Gaussian mixture cross entropy loss proposed in [44] is adopted since it accumulates the mean and variance for samples with the same label , and models it as the Gaussian , hence . The first stage specifies the label relevant distribution. In the second stage, the two encoders and the decoder are trained jointly in an end-to-end manner based on the reconstruction loss. Meanwhile, priors of and are also considered.
The main contribution of this paper lies in following aspects: (1) for a single input to the encoder, we provide an algorithm to disentangle the latent space into label relevant and irrelevant dimensions in VAE. Previous works like [15, 4, 37] disentangle the latent space in AE not VAE. So it is impossible to make the inference from their model. Moreover, [27, 4, 23] requires at least two inputs for training. (2) we find the Gaussian mixture loss function is suitable way for estimating the parameters of the prior distribution, and it can be optimized in VAE framework. (3) we give both a theoretical derivation and a variety of detailed experiments to explain the effectiveness of our work.
2 Related works
Two types of methods for the structured image generation are VAE and GAN. VAE [20] is a type of parametric model defined by and , which employs the idea of variational inference to maximize the evidence lower bound (ELBO), as is shown in (1).
[TABLE]
The right side of the above is the ELBO, which is the lower bound of maximum likelihood. In VAE, a differentiable encoder-decoder are connected, and they are parameterized by and , respectively. represents the end-to-end reconstruction loss, and is the KL divergence between the encoder’s output distribution and the prior , which is usually modeled by standard normal distribution . Note that VAE assumes that the posterior is of Gaussian, and the and are estimated for every single input by the encoder. This strategy is named amortized variational inference (AVI), and it is more efficiency than stochastic variational inference (SVI) [17].
VAE’s advantage is that its loss is easy to optimize, but the simple prior in latent space may not capture the complex data patterns which often leads to the mode collapse in latent space. Moreover, VAE’s code is hard to be interpreted. Thus, many works focus on improving VAE on these two aspects. cVAE [39] adds the label vector as the input for both the encoder and decoder, so that the latent code and generated image are conditioned on the label, and potentially prevent the latent collapse. On the other hand, -VAE [16, 7] is a unsupervised approach for the latent space disentanglement. It introduces a simple hyper-parameter to balance the two loss term in (1). A scheme named infinite mixture of VAEs is proposed and applied in semi-supervised generation [1]. It uses multiple number of VAEs and combines them as a non-parametric mixture model. In [19], the semi-amortized VAE is proposed. It combines AVI with SVI in VAE. Here the SVI estimates the distribution parameters on the whole training set, while the AVI in traditional VAE gives this estimation for a single input.
GAN [13] is another technique to model the data distribution . It starts from a random , where is simple, e.g. Gaussian, and trains a transform network under the help of discriminator so that approximates . The later works [32, 26, 2, 14, 29] try to stabilize GAN’s training. Traditional GAN works in a fully supervised manner, while cGAN [18, 33, 30, 6] aims to generate images conditioned on labels. In cGAN, the label is given as an input to both the generator and discriminator as a condition for the distribution. The encoder-decoder architecture like AE or VAE can also be used in GAN. In ALI [11] and BiGAN [10], the encoder maps to , while the decoder reverses it. The discriminator takes the pair of and , and is trained to determine whether it comes from the encoder or decoder in an adversarial manner. In VAE-GAN [22, 24], VAE’s generated data are improved by a discriminator. Similar idea also applies to cVAE in [3]. VAE-GAN also applies in some specific applications like [4, 12].
Since code potentially affects the generated data, some works try to model its effect and disentangle the dimensions of . InfoGAN [9] reveals the effect of latent space code by maximizing the mutual information between and the synthetic data . Its generator outputs which is inspected by the discriminator . also tries to reconstruct the code . In [27], the latent dimension is disentangled in VAE based on the specified factors and unspecified ones, which is similar with our work. But its encoder takes multiple inputs, and the decoder combines codes from different inputs for reconstruction. The work in [15] modifies [27] by taking a single input. To stabilize training, its model is built in AE not VAE, hence it can’t perform variational inference. Other works in [37, 4, 23] are also built in AE and more than two inputs. Moreover they only apply in a particular domain like face [37, 4] or image-to-image translation [23], while our work is built in VAE and takes only a single input for a more general case.
3 Proposed method
We propose a image generation algorithm based on VAE which divides the encoder into two separate ones, one encoding label relevant representation and the other encoding label irrelevant information . is learned with supervision of the categorical class label and it is required to follow a Gaussian mixture distribution, while is wished to contain other common information irrelevant to the label and is made close to standard Gaussian .
3.1 Problem formulation
Given a labeled dataset , where is the -th images and is the corresponding label. and are the number of classes and the size of the dataset, respectively. The goal of VAE is to maximum the ELBO defined in (1), so that the data log-likelihood is also maximized. The key idea is to split the full latent code into the label relevant dimensions and the irrelevant dimensions , which means fully reflects the class but dose not. Thus the objective can be rewritten as (derived in detail in Appendices).
[TABLE]
In Eq. 2, the ELBO becomes 3 terms in our setting. The first term is the negative reconstruction error, where is the decoder parameterized by . It measures whether the latent code and are informative enough to recover the original data. In practice, the reconstruction error can be defined as the loss between and . The second term acts as a regularization term of label irrelevant branch that pushes to match the prior distribution , which is illustrated in detail in Section 3.2. The third term matches to a class-specific Gaussian distribution whose mean and covariance are learned with supervision, and it will be further introduced in Section 3.3.
3.2 Label irrelevant branch
Intuitively, we want to disentangle the latent code into and , and expect to follow a fixed, prior distribution which is irrelevant to the label. This regularization is realized by minimizing KL divergence between and the prior as illustrated in Eq. 3. More specifically, is a Gaussian distribution whose mean and diagonal covariance are the output of parameterized by . is simply set to . Hence the KL regularization term is:
[TABLE]
Note that Eq. 3 can be represented in a closed form, which is easy to be computed.
To ensure good disentanglement in and , we introduce adversarial learning in the latent space as in AAE [25] to drive the label relevant information out of . To do this, an adversarial classifier is added on the top of , which is trained to classify the category of with cross entropy loss as is shown in (4):
[TABLE]
where is the indicator function, and is softmax probability output by the adversarial classifier parameterized by . Meanwhile, is trained to fool the classifier, hence the target distribution becomes uniform over all categories, which is . The cross entropy loss is defined as (5).
[TABLE]
3.3 Label relevant branch
Inspired by GM loss [44], we expect to follow a Gaussian mixture distribution, expressed in Eq. 6, where and are the mean and covariance of Gaussian distribution for class , and is the prior probability, which is simply set to for all categories. For simplicity, we ignore the correlation among different dimensions of , hence is assumed to be diagonal.
[TABLE]
Recall that in Eq. 2, the KL divergence between and is minimized. If is formulated as a Gaussian distribution with its and its mean output by , which is actually a Dirac delta function , the KL divergence turns out to be the likelihood regularization term in Eq. 7, which is proved in Appendices. Here and are the mean and covariance specified by the label .
[TABLE]
Furthermore, we want to contain label information as much as possible, thus the mutual information between and class is added to the maximization objective function. We prove in Appendices that it’s equal to minimize the cross-entropy loss of the posterior probability and the label, which is exactly the classification loss in GM loss as is shown in Eq. 8.
[TABLE]
These two terms are added up to form GM loss in Eq. 9. Here is finally used to train the .
[TABLE]
3.4 The decoder and the adversarial discriminator
The latent codes and output by and are first concatenated together, and then further given to the decoder to reconstruct the input by . Here the is indicated by with its parameter learned from the reconstruction error . To synthesize a high quality , we also employ the adversarial training in the pixel domain. Specifically, a discriminator with adversarial training on its parameter is used to improve . Here the label is utilized in like in [30]. The adversarial training loss for discriminator can be formulated as in Eq. 10,
[TABLE]
while this loss becomes
[TABLE]
for the generator. Note that here is the decoder and is defined in Eq. 6.
3.5 Training algorithm
The training detail is illustrated in Algorithm 1. The , modeled by , extracts label relevant code . is trained with and , encouraging to be label dependent and follow a learned Gaussian mixture distribution. Meanwhile, the represented by is intended to extract class irrelevant code . It’s trained by , and to make irrelevant to the label and be close to . The adversarial classifier parameterized by is learned to classify using . Then the decoder generates reconstruction image using the combined feature of and with the loss .
In the training process, a 2-stage alternating training algorithm is adopted. First, is updated using to learn mean and covariance of the prior . Then, the two encoders and the decoder are trained jointly to reconstruct images while the distributions of and are considered.
3.6 Application in semi-supervised generation
Given unlabeled extra data , we now use our architecture for the semi-supervised generation, in which the labels of in are not presented. Here we hold the assumption that are in the same domain as the fully supervised , but can be satisfied , or out of the predefined range. In other words, if the absent is in the predefined range, its follows the same Gaussian mixture distribution as in Eq. 6. Otherwise, should follow an ambiguous Gaussian distribution defined in Eq. 11.
[TABLE]
More specifically, is expected to follow where and are the total mean and covariance of all the class-specific Gaussian distributions as illustrated in Eq. 6. Here, is diagonal matrix with as its variance vector. is also the variance vector of . Hence, the likelihood regularization term becomes . The whole network is trained in a end-to-end manner using total losses. Note that in this setting, the label is not provided, so , and are ignored in the training process.
4 Experiments
In this section, experiments are carried out to validate the effectiveness of the proposed method. A toy example is first designed to show that by disentangling the label relevant and irrelevant codes, our model has the ability of generating diverse data samples than cVAE-GAN [3]. We then compare the quality of generated images on real image datasets. The latent space is also analyzed. Finally, the experiments of semi-supervised generation and image inpainting show the flexibility of our model, hence it may have many potential applications.
4.1 Toy examples
This section demonstrates our method on a toy example, in which the real data distribution lies in 2D with one dimension ( axis) being label relevant and the other ( axis) being irrelevant. The distribution is assumed to be known. There are 3 types of data points indicated by green, red and blue, belonging to 3 classes. The 2D data points and their corresponding labels are given to our model for variational inference and the new sample generation.
For comparison, we also give the same training data to cVAE-GAN for the same purpose. The two compared models share the similar settings of the network. In our model, the two encoders are both MLP with 3 hidden layers, and there are 32, 64, and 64 units in them. In cVAE-GAN, the encoder is the same, but it only has one encoder. The discriminators are exactly the same, which is also an MLP of 3 hidden layers with 32, 64, and 64 units. Adam is used as the optimization method in which a fixed learning rate of 0.0005 is applied for both. Each model is trained for 50 epochs until they all converge. The generated samples of each model are plotted in Figure 2.
From Figure 2 we can observe that both two models can capture the underlying data distribution, and our model converges at the similar rate. The advantage of our model is that it tends to generate diverse samples, while cVAE-GAN generates samples in a conserving way in which the label irrelevant dimensions are within the limited value range.
4.2 Analysis on generated image quality
In this section, we compare our method with other generative models for image generation quality. The experiments are conducted on two datasets: FaceScrub [31] and CIFAR-10 [21]. The FaceScrub contains training images from different identities. For FaceScrub, a cascaded object detector proposed in [42] is first used to detect faces first, and then the face alignment is also conducted based on SDM proposed in [46]. The detected cropped faces are resized to the fixed size 6464. In the training process, Adam optimizer with is used. The hyper parameter , , and are set to 0.1, , and , respectively. Here, is the number of image pixels, and is the dimension of . Since our method incorporates the label for training, popular generative networks conditioned on label, like cVAE [39], cVAE-GAN [3], and cGAN [30], are chosen for comparison. For cVAE, cVAE-GAN and cGAN, we randomly generate samples of class by first sampling and then concatenating and one hot vector of as the input of decoder/generator. As for ours, and are sampled and combined for decoder to generate samples. Some of generated images are visualized in Figure 9. It shows that samples generated by cVAE are highly blurred, and cGAN suffers from mode collapse. Samples generated by cVAE-GAN and our method seem to have similar quality, we refer to two metrics, [36] and intra-class diversity [5] to compare them.
We adopt to evaluate realism and inter-class diversity of images. Generated images that are close to real images of class should have a posterior probability with low entropy. Meanwhile, images of diverse classes should have a marginal probability with high entropy. Hence, , formulated as , gets a high value when images are realistic and diverse.
To get conditional class probability , we first train a classifier with Inception-ResNet-v1 [40] architecture on real data. Then we randomly generate 53k samples(100 for each class) of FaceScrub and 5k samples (500 for each class) of CIFAR-10, and apply them to the pre-trained classifier. The marginal is obtained by averaging all . The results are listed in Table 4.
We emphasize that our method will generate more diverse samples in one class. Since only measures inter-class diversity, intra-class diversity of samples should also be taken into account. We adopt the metric proposed in [5], which measures the average negative MS-SSIM [45] between all pairs in the generated image set . Table 2 shows the inter-class diversity of cVAE-GAN and our method on FaceScrub and CIFAR-10.
[TABLE]
4.3 Analysis on disentangled latent space
We now evaluate our proposal on the disentangled latent space, which is represented by label relevant dimensions and irrelevant ones . for class is supposed to capture the variation unique to training images within the label , while should contain the variation in common characteristics for all classes. It’s validated in the following ways: (1) fixing and varying . In this setting, we directly sample a , and keep it fixed. Then a set of for class is obtained by first getting a series of random codes sampled from and then mapping them to class . In specific, we first sample and . Then a set of random codes are obtained by linear interpolation, i.e., . We map each to class with . Finally each is concatenated with the fixed and given to the decoder to get a generated image. (2) fixing and varying . Similar to (1), we first sample a from a learned distribution and keep it fixed. Then a set of label irrelevant are obtained by linearly interpolating between and , where and are sampled from .
We conduct experiments on FaceScrub and the generated images are shown in Figure 4. In Figure 4 (a), each row presents samples generated by linearly transformed of a certain class and a fixed . All three rows share the same , and each column shares the same random code and just maps it to different class with . It shows that as varies, one may change differently for different identities, e.g., grow a beard, wrinkle, or take off the make-up. In Figure 4 (b), each row presents samples with linearly transformed a fixed of class , and each column shares a same . We can see that images from each row change consistently with poses, expressions and illuminations. These two experiments suggest that is relevant to , while reflects more common label irrelevant characteristics.
We are also interested in each dimension in and conduct an experiment by varying a single element in it. We find three dimensions in which reflect the meaningful the common characteristics, such as the expression, elevation and azimuth.
4.4 Semi-supervised image generation
According to the details in Section 3.6, the experiments on semi-supervised image generation are conducted. We find our method can learn well disentangled latent representation when the unlabeled extra data are available. To validate that, we randomly select 200 identities of about 21k images from CASIA [47] dataset and remove their labels to form unlabeled dataset . Note that the identities in are totally different with those in FaceScrub. After training the whole network on labeled dataset , we finetune it on using the training algorithm illustrated in Section 3.6.
To demonstrate the semi-supervised generation results, two different images are given to and to generate the code and , respectively. Then, the decoder is required to synthesis a new image based on the concatenated code from and . The Figure 6 shows face synthesis results using images whose identities have not appeared in . The first row and first column show a set of original images providing and respectively, while images in the middle are generated ones using of the corresponding row and of the corresponding column. It is obvious that the identity depends on , while other characteristics like the poses, illumination, expressions are reflected on . This semi-supervised generation shows and can also be disentangled on identities outside the labeled training data , which provides the great flexibility for image generation.
4.5 Image inpainting
Our method can also be applied to image inpainting. It means that given a partly corrupted image, we can extract meaningful latent code to reconstruct the original image. Note that in cVAE-GAN [3], an extra class label should be provided for reconstruction while it’s needless in our method. In practice, we first corrupt some patches for a image , namely right-half, eyes, nose and mouth, and bottom-half regions, then input those corrupted images into the two encoders to get and , then the reconstructed image is generated using a combined and . The image inpainting result is obtained by , where is the binary mask for the corrupted patch. Figure 7 shows the results of image inpainting. cVAE-GAN struggles to complete the images when it comes to a large part of missing regions (e.g. right-half and bottom-half parts) or pivotal regions of faces (e.g. eyes), while our method provides visually pleasing results.
5 Conclusion
We propose a latent space disentangling algorithm on VAE baseline. Our model learns two separated encoders and divides the latent code into label relevant and irrelevant dimensions. Together with a discriminator in pixel domain, we show that our model can generate high quality and diverse images, and it can also be applied in semi-supervised image generation in which unlabeled data with unseen classes are given to the encoders. Future research includes building more interpretable latent dimensions with help of more labels, and reducing the correlation between the label relevant and irrelevant codes in our framework.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Project 61302125, and in part by Natural Science Foundation of Shanghai under Project 17ZR1408500. Corresponding to [email protected]
Appendix A Mathematical proofs
A.1 The ELBO of the log-likelihood objective
. We declare in Equation 2 that after dividing the latent space into label relevant dimensions and label irrelevant dimensions , the ELBO of the log-likelihood objective becomes 3 terms in our setting.
[TABLE]
Proof
Our generative process is described as follows. First, sample a label relevant code and a label irrelevant code . Then, a decoder , taking the combination of and as input, maps latent codes to images. Hence, we factorize the joint distribution as:
[TABLE]
By using Jensen’s inequality, the log-likelihood can be written as:
[TABLE]
A.2 Log-likelihood regularization term in the label relevant branch
Note that the KL divergence , the third term of the ELBO in Equation 2, is minimized. If we assume conditional independence between and the class , then we have
[TABLE]
where is the one-hot encoding of the label . If is formulated as Gaussian distribution with and mean output by , which is actually a Dirac delta function.
[TABLE]
The KL regularization term becomes
[TABLE]
The second term relates to the prior distribution, so it can be regraded as a constant. The third term is negative entropy of delta function and has nothing to do with , hence we consider it as a constant too. Therefore, we have
[TABLE]
where the prior distribution is set to . Ignoring the constant term, it turns out to be the likelihood regularization term in Equation 7.
[TABLE]
A.3 Cross-entropy objective in the label relevant branch
To encourage to become label relevant as much as possible, the mutual information is maximized, where . In practice, is hard to optimize directly because it requires access to . We can instead optimize its lower bound by introducing an auxiliary distribution to approximate as in infoGAN [9] .
[TABLE]
Since we still need to sample from in the inner expectation, we adopt Lemma 5.1 in infoGAN to further remove the need of . The first term of the lower bound is a constant, so we ignore it. Then the second term becomes
[TABLE]
We hold the assumption that the process of sampling is independent on , thus
[TABLE]
According to Lemma 5.1 in infoGAN, we have
[TABLE]
Hence
[TABLE]
We further factorize as , the equation above becomes
[TABLE]
where is the one-hot encoding of the label , i.e. . To maximize it is to minimize its opposite, which is exactly the classification loss in Section 3.3.
[TABLE]
Appendix B Experimental details
B.1 Dataset synthesis of toy example
Our synthetic dataset of toy example is a modification of the two-moon dataset, which contains three half circles instead of two. The generative process is described as follows. First, sample data points from three half unit circles with a horizontal interval of 2.2. Then, add Gaussian noises with to all of them.
B.2 Network architecture of FaceScrub
For the two encoders, and , we use VGG [38] architecture with batch normalization layers added to each layer and replace the last three fc layers with two fc layers of 1024 and 512 units. For the decoder, an inverse structure of the encoders is applied. The adversarial classifier in Section 3.2 consists of two fc layers of 256 and 530 units, and the discriminator contains 7 convolution layers and two fc layers (details are shown in Table 4). Note that spectral normalization [29] is applied to to the all of the weights in the discriminator and the label embedding is incorporated in the first fc layer as in [30].
B.3 Network architecture of Cifar-10
The network structures of the two encoders, decoder and discriminator for Cifar-10 are shown in Table 3. The adversarial classifier in the latent space is similar as that used for FaceScrub, which are two fc layers of 256 and 10 units. Also, spectral normalization and label embedding are applied in the discriminator.
B.4 Optimization
We use Adam optimizer with , and . Since in the training process, the first stage using is trained 3 times per second stage iteration, converges fast. Continuously training after it converges will cause instability of because goes down gradually. In practice, we decay the learning rate of by 0.01 after 2 epochs.
B.5 Inception Score
Recall that requires access to the conditional class probability . We use classification model of Inception-ResNet-v1 [40] architecture trained on VGGFace2 [8] to evaluate generative models trained on FaceScrub. For generative models trained on Cifar-10, classification model of Inception-v3 [41] architecture trained on ImageNet [35] is used.
Appendix C Additional experiment results
C.1 More generated samples on FaceScrub and Cifar-10
Figure 8 shows generated samples of our method on FaceScrub and Cifar-10 with each row corresponding to a certain class.
C.2 Additional experiments on CUB-200-2011 and Cifar-100
We additionally apply our method to CUB-200-2011 [43] and Cifar-100 [21] dataset. The CUB-200-2011 contains 200 categories of birds with 11,788 images in total. For CUB-200-2011, we crop the images according to the bounding boxes provided by the dataset and resize the cropped images to 64 64. The network structure is just same as it used in FaceScrub. For Cifar-100, we use the same network as in Cifar-10. Generated images are shown in Figure 9. Results of and intra-class diversity are listed in Table 5 and Table 6, respectively.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. E. Abbasnejad, A. Dick, and A. van den Hengel. Infinite variational autoencoder for semi-supervised learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 781–790. IEEE, 2017.
- 2[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. stat , 1050:9, 2017.
- 3[3] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvae-gan: Fine-grained image generation through asymmetric training. In 2017 IEEE International Conference on Computer Vision (ICCV) , pages 2764–2773. IEEE, 2017.
- 4[4] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Towards open-set identity preserving face synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6713–6722, 2018.
- 5[5] M. Ben-Yosef and D. Weinshall. Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images. ar Xiv preprint ar Xiv:1808.10356 , 2018.
- 6[6] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , volume 1, page 7, 2017.
- 7[7] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in β 𝛽 \beta -vae. ar Xiv preprint ar Xiv:1804.03599 , 2018.
- 8[8] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface 2: A dataset for recognising faces across pose and age. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on , pages 67–74. IEEE, 2018.
