Multi-Adversarial Variational Autoencoder Networks

Abdullah-Al-Zubaer Imran; Demetri Terzopoulos

arXiv:1906.06430·cs.LG·June 18, 2019

Multi-Adversarial Variational Autoencoder Networks

Abdullah-Al-Zubaer Imran, Demetri Terzopoulos

PDF

TL;DR

MAVENs is a novel ensemble discriminator architecture combining VAEs and GANs, enhancing image generation and classification in semi-supervised learning across diverse datasets.

Contribution

Introduces MAVENs, a new network architecture with multiple discriminators for improved semi-supervised image generation and classification.

Findings

01

Competitive performance on CIFAR-10, SVHN, and Chest X-Ray datasets.

02

Effective in both image synthesis and semi-supervised classification.

03

Outperforms some state-of-the-art models in experiments.

Abstract

The unsupervised training of GANs and VAEs has enabled them to generate realistic images mimicking real-world distributions and perform image-based unsupervised clustering or semi-supervised classification. Combining the power of these two generative models, we introduce Multi-Adversarial Variational autoEncoder Networks (MAVENs), a novel network architecture that incorporates an ensemble of discriminators in a VAE-GAN network, with simultaneous adversarial learning and variational inference. We apply MAVENs to the generation of synthetic images and propose a new distribution measure to quantify the quality of the generated images. Our experimental results using datasets from the computer vision and medical imaging domains---Street View House Numbers, CIFAR-10, and Chest X-Ray datasets---demonstrate competitive performance against state-of-the-art semi-supervised models both in image…

Tables4

Table 1. Table 1: Minimum FID and DDD scores achieved by the DC-GAN, VAE-GAN, and MAVEN models for the CIFAR-10, SVHN, and CXR datasets.

CIFAR-10			SVHN			CXR
Model	FID	DDD	Model	FID	DDD	Model	FID	DDD
DC-GAN	61.293 $\pm$ 0.209	0.265	DC-GAN	16.789 $\pm$ 0.303	0.343	DC-GAN	152.511 $\pm$ 0.370	0.145
VAE-GAN	15.511 $\pm$ 0.125	0.224	VAE-GAN	13.252 $\pm$ 0.001	0.329	VAE-GAN	141.422 $\pm$ 0.580	0.107
MAVEN-mean2D	12.743 $\pm$ 0.242	0.223	MAVEN-mean2D	11.675 $\pm$ 0.001	0.309	MAVEN-mean2D	141.339 $\pm$ 0.420	0.138
MAVEN-mean3D	11.316 $\pm$ 0.808	0.190	MAVEN-mean3D	11.515 $\pm$ 0.065	0.300	MAVEN-mean3D	140.865 $\pm$ 0.983	0.018
MAVEN-mean5D	12.123 $\pm$ 0.140	0.207	MAVEN-mean5D	10.909 $\pm$ 0.001	0.294	MAVEN-mean5D	147.316 $\pm$ 1.169	0.100
MAVEN-rand2D	12.820 $\pm$ 0.584	0.194	MAVEN-rand2D	11.384 $\pm$ 0.001	0.316	MAVEN-rand2D	154.501 $\pm$ 0.345	0.038
MAVEN-rand3D	12.620 $\pm$ 0.001	0.202	MAVEN-rand3D	10.791 $\pm$ 0.029	0.357	MAVEN-rand3D	158.749 $\pm$ 0.297	0.179
MAVEN-rand5D	18.509 $\pm$ 0.001	0.215	MAVEN-rand5D	11.052 $\pm$ 0.751	0.323	MAVEN-rand5D	152.778 $\pm$ 1.254	0.180
Dropout-GAN[21]	88.60 $\pm$ 0.08	-
TTUR[12]	36.9	-
Coulomb GANs[32]	27.300	-
AIQN[26]	49.500	-
SN-GAN[20]	21.700	-
Learned Moments[28]	18.9	-

Table 2. Table 2: Average cross-validation accuracy and class-wise F1 scores for the semi-supervised classification performance comparison of the DC-GAN, VAE-GAN, and MAVEN models using the SVHN dataset.

Model	Accuracy	F1 scores
		0	1	2	3	4	5	6	7	8	9
DC-GAN	0.876	0.860	0.920	0.890	0.840	0.890	0.870	0.830	0.890	0.820	0.840
VAE-GAN	0.901	0.900	0.940	0.930	0.860	0.920	0.900	0.860	0.910	0.840	0.850
MAVEN-mean2D	0.909	0.890	0.930	0.940	0.890	0.930	0.900	0.870	0.910	0.870	0.890
MAVEN-mean3D	0.909	0.910	0.940	0.940	0.870	0.920	0.890	0.870	0.920	0.870	0.860
MAVEN-mean5D	0.905	0.910	0.930	0.930	0.870	0.930	0.900	0.860	0.910	0.860	0.870
MAVEN-rand2D	0.905	0.910	0.930	0.940	0.870	0.930	0.890	0.860	0.920	0.850	0.860
MAVEN-rand3D	0.907	0.890	0.910	0.920	0.870	0.900	0.870	0.860	0.900	0.870	0.890
MAVEN-rand5D	0.903	0.910	0.930	0.940	0.860	0.910	0.890	0.870	0.920	0.850	0.870

Table 3. Table 3: Average cross-validation accuracy and class-wise F1 scores for the semi-supervised classification performance comparison of the DC-GAN, VAE-GAN, and MAVEN models using the CIFAR-10 dataset.

Model	Accuracy	F1 scores
		airplane	automobile	bird	cat	deer	dog	frog	horse	ship	truck
DC-GAN	0.713	0.760	0.840	0.560	0.510	0.660	0.590	0.780	0.780	0.810	0.810
VAE-GAN	0.743	0.770	0.850	0.640	0.560	0.690	0.620	0.820	0.770	0.860	0.830
MAVEN-mean2D	0.761	0.800	0.860	0.650	0.590	0.750	0.680	0.810	0.780	0.850	0.850
MAVEN-mean3D	0.759	0.770	0.860	0.670	0.580	0.700	0.690	0.800	0.810	0.870	0.830
MAVEN-mean5D	0.771	0.800	0.860	0.650	0.610	0.710	0.640	0.810	0.790	0.880	0.820
MAVEN-rand2D	0.757	0.780	0.860	0.650	0.530	0.720	0.650	0.810	0.800	0.870	0.860
MAVEN-rand3D	0.756	0.780	0.860	0.640	0.580	0.720	0.650	0.830	0.800	0.870	0.830
MAVEN-rand5D	0.762	0.810	0.850	0.680	0.600	0.720	0.660	0.840	0.800	0.850	0.820

Table 4. Table 4: Average cross-validation accuracy and class-wise F1 scores for the semi-supervised classification performance comparison of the DC-GAN, VAE-GAN, and MAVEN models using the CXR dataset.

Model	Accuracy	F1 scores
		Normal	B-Pneumonia	V-Pneumonia
DC-GAN	0.461	0.300	0.520	0.480
VAE-GAN	0.467	0.220	0.640	0.300
MAVEN-mean2D	0.469	0.310	0.620	0.260
MAVEN-mean3D	0.525	0.640	0.480	0.480
MAVEN-mean5D	0.477	0.380	0.480	0.540
MAVEN-rand2D	0.478	0.280	0.630	0.310
MAVEN-rand3D	0.506	0.440	0.630	0.220
MAVEN-rand5D	0.483	0.170	0.640	0.240

Equations49

D max V (D)

D max V (D)

G min V (G)

V (D) = \frac{1}{K} i = 1 \sum K w_{i} D_{i}

V (D) = \frac{1}{K} i = 1 \sum K w_{i} D_{i}

\nabla_{θ_{D_{k}}} \frac{1}{m} i = 1 \sum m [lo g D_{k} (x_{i}) + lo g (1 - D_{k} (G (z_{i})))]

\nabla_{θ_{D_{k}}} \frac{1}{m} i = 1 \sum m [lo g D_{k} (x_{i}) + lo g (1 - D_{k} (G (z_{i})))]

D_{μ} = \frac{1}{K} i \sum K w_{i} D_{i}

D_{μ} = \frac{1}{K} i \sum K w_{i} D_{i}

\nabla_{θ_{G}} \frac{1}{m} i = 1 \sum m [lo g (1 - D_{μ} (G (z_{i})))]

\nabla_{θ_{G}} \frac{1}{m} i = 1 \sum m [lo g (1 - D_{μ} (G (z_{i})))]

\nabla_{θ_{E_{q_{λ} (z ∣ x)}}} [lo g \frac{p ( z )}{q _{λ} ( z ∣ x )}]

\nabla_{θ_{E_{q_{λ} (z ∣ x)}}} [lo g \frac{p ( z )}{q _{λ} ( z ∣ x )}]

p (y = n + 1∣ x) = \frac{exp ( l _{n + 1} )}{\sum _{j = 1}^{n + 1} exp ( l _{j} )},

p (y = n + 1∣ x) = \frac{exp ( l _{n + 1} )}{\sum _{j = 1}^{n + 1} exp ( l _{j} )},

p (y = i ∣ x, i < n + 1) = \frac{exp ( l _{i} )}{\sum _{j = 1}^{n + 1} exp ( l _{j} )} .

p (y = i ∣ x, i < n + 1) = \frac{exp ( l _{i} )}{\sum _{j = 1}^{n + 1} exp ( l _{j} )} .

L_{D_{supervised}} = - E_{x, y \sim p_{data}} lo g [p (y = i ∣ x, i < n + 1)] .

L_{D_{supervised}} = - E_{x, y \sim p_{data}} lo g [p (y = i ∣ x, i < n + 1)] .

L_{D_{real}} = - E_{x \sim p_{data}} lo g [1 - p (y = n + 1∣ x)],

L_{D_{real}} = - E_{x \sim p_{data}} lo g [1 - p (y = n + 1∣ x)],

L_{D_{fake1}} = - E_{\overset{x}{^} \sim G} lo g [p (y = n + 1∣ \hat{x})],

L_{D_{fake1}} = - E_{\overset{x}{^} \sim G} lo g [p (y = n + 1∣ \hat{x})],

L_{D_{fake2}} = - E_{\tilde{x} \sim G} lo g [p (y = n + 1∣ \tilde{x})],

L_{D_{fake2}} = - E_{\tilde{x} \sim G} lo g [p (y = n + 1∣ \tilde{x})],

L_{D_{unsupervised}} = L_{D_{real}} + L_{D_{fake1}} + L_{D_{fake2}} .

L_{D_{unsupervised}} = L_{D_{real}} + L_{D_{fake1}} + L_{D_{fake2}} .

L_{G_{feature}} = ∣∣ E_{x \sim p_{data}} f (x) - E_{\overset{x}{^} \sim G} f (\overset{x}{^}) ∣ ∣_{2}^{2} .

L_{G_{feature}} = ∣∣ E_{x \sim p_{data}} f (x) - E_{\overset{x}{^} \sim G} f (\overset{x}{^}) ∣ ∣_{2}^{2} .

L_{G} = L_{G_{feature}} + L_{G_{fake1}} + L_{G_{fake2}} .

L_{G} = L_{G_{feature}} + L_{G_{fake1}} + L_{G_{fake2}} .

L_{G_{fake1}} = - E_{\overset{x}{^} \sim G} lo g [1 - p (y = n + 1∣ \overset{x}{^})],

L_{G_{fake1}} = - E_{\overset{x}{^} \sim G} lo g [1 - p (y = n + 1∣ \overset{x}{^})],

L_{G_{fake2}} = - E_{\tilde{x} \sim G} lo g [1 - p (y = n + 1∣ \tilde{x}] .

L_{G_{fake2}} = - E_{\tilde{x} \sim G} lo g [1 - p (y = n + 1∣ \tilde{x}] .

L_{E} = L_{E_{KL}} + L_{E_{feature}},

L_{E} = L_{E_{KL}} + L_{E_{feature}},

L_{E_{KL}} = - K L [q_{λ} (z ∣ x) ∣∣ p (z)] = E_{q_{λ} (z ∣ x)} [lo g \frac{p ( z )}{q _{λ} ( z ∣ x )}] \approx E_{q_{λ} (z ∣ x)}

L_{E_{KL}} = - K L [q_{λ} (z ∣ x) ∣∣ p (z)] = E_{q_{λ} (z ∣ x)} [lo g \frac{p ( z )}{q _{λ} ( z ∣ x )}] \approx E_{q_{λ} (z ∣ x)}

L_{E_{feature}} = ∣∣ E_{x \sim p_{data}} f (x) - E_{\tilde{x} \sim G} f (\tilde{x}) ∣ ∣_{2}^{2} .

L_{E_{feature}} = ∣∣ E_{x \sim p_{data}} f (x) - E_{\tilde{x} \sim G} f (\tilde{x}) ∣ ∣_{2}^{2} .

FID = ∣∣ μ_{data} - μ_{fake} ∣ ∣^{2} + T r (Σ_{data} + Σ_{fake} - 2 Σ_{data} Σ_{fake}) .

FID = ∣∣ μ_{data} - μ_{fake} ∣ ∣^{2} + T r (Σ_{data} + Σ_{fake} - 2 Σ_{data} Σ_{fake}) .

DDD = - i = 1 \sum i = 4 lo g w_{i} ∣ μ_{data_{i}} - μ_{fake_{i}} ∣.

DDD = - i = 1 \sum i = 4 lo g w_{i} ∣ μ_{data_{i}} - μ_{fake_{i}} ∣.

F1 = \frac{2 \times precision \times recall}{precision + recall},

F1 = \frac{2 \times precision \times recall}{precision + recall},

precision = \frac{TP}{TP + FP} and recall = \frac{TP}{TP + FN},

precision = \frac{TP}{TP + FP} and recall = \frac{TP}{TP + FN},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSolana Customer Service Number +1-833-534-1729

Full text

Multi-Adversarial Variational Autoencoder Networks

Abdullah-Al-Zubaer Imran

[email protected] &Demetri Terzopoulos

[email protected] \AND

Computer Science Department

University of California, Los Angeles

Abstract

The unsupervised training of GANs and VAEs has enabled them to generate realistic images mimicking real-world distributions and perform image-based unsupervised clustering or semi-supervised classification. Combining the power of these two generative models, we introduce Multi-Adversarial Variational autoEncoder Networks (MAVENs), a novel network architecture that incorporates an ensemble of discriminators in a VAE-GAN network, with simultaneous adversarial learning and variational inference. We apply MAVENs to the generation of synthetic images and propose a new distribution measure to quantify the quality of the generated images. Our experimental results using datasets from the computer vision and medical imaging domains—Street View House Numbers, CIFAR-10, and Chest X-Ray datasets—demonstrate competitive performance against state-of-the-art semi-supervised models both in image generation and classification tasks.

1 Introduction

Training deep neural networks usually requires a large pool of labeled data, yet obtaining large datasets for tasks such as image classification remains a fundamental challenge. Although there has been explosive progress in the production of vast quantities of high resolution images, large collections of labeled data required for supervised learning remain scarce. Especially in domains such as medical imaging, datasets are limited in size due to privacy issues, and manual annotation by medical experts is expensive, time-consuming, and prone to subjectivity, human error, and variance across different experts. Even when large labeled datasets become available, they are often highly imbalanced and nonuniformly distributed. For instance, in an imbalanced medical dataset there will be an over-representation of common medical problems and an under-representation of rare conditions. Such biases make the training of neural networks across multiple classes with similar effectiveness very challenging.

The small-training-data problem is traditionally mitigated through simplistic and cumbersome data augmentation, often by creating new training examples through translation, rotation, flipping, etc. The missing or mismatched label problem can be addressed by evaluating similarity measures over the training examples. This is not always robust and the efficiency largely depends on the performance of the similarity measuring algorithms.

Generative models, such as VAEs [16] and GANs [9], have recently become popular because of their ability to learn underlying data distributions from training samples. This has made generative models more practical in ever-frequent scenarios where there is an abundance of unlabeled data. With minimal annotation, an efficient semi-supervised learning model could be a go-to approach. More specifically, based on small quantities of annotation, generative models could be utilized to learn real-data distributions and synthesize realistic new training images. Both VAEs and GANs can be employed for this purpose.

VAEs can learn the dimensionality-reduced representation of training data and, with an explicit density estimation, can generate new samples. However VAE-generated samples are usually blurry (Fig. 1b). On the other hand, despite the successes in generating images and semi-supervised classifications, GAN frameworks are still very difficult to train and there are challenges in using GAN models, such as non-convergence due to unstable training, mode collapsed image generation (Fig. 1c), diminished gradient, overfitting, and high sensitivity to hyper-parameters.

To stabilize GAN training and combat mode collapse, several variants have been proposed. Nguyen et al. [24] proposed a model, where a single generator is used alongside dual discriminators. Durugkar et al. [7] proposed a model with a single generator and feedback aggregated over several discriminators considering either the average loss of all discriminators or by picking only the discriminator with the maximum loss in relation to the generator’s output. Neyshabur et al. [23] proposed a framework where a single generator simultaneously trains against an array of discriminators, each of which operates on a different low-dimensional projection of the data. Mordido et al. [21], arguing that all the previous approaches restrict the discriminator’s architecture, which compromises the extensibility of the framework, instead proposed a Dropout-GAN, where a single generator is trained against a dynamically changing ensemble of discriminators. However, there could be a risk of dropping out all the discriminators. Feature matching and minibatch discrimination techniques have been proposed [30] for eliminating mode collapsing and preventing overfitting in GAN training.

Although there have been wide ranging efforts in high quality image generation with GANs and VAEs, accuracy and image quality are usually not ensured in the same model, especially in multi-class image classification. To tackle this issue, we propose a novel method that can learn joint image generation and multi-class image classification. Our specific contribution is the Multi-Adversarial Variational autoEncoder Network, or MAVEN, a novel multi-class image classification model incorporating an ensemble of discriminators in a combined VAE-GAN network. An ensemble layer combines the feedback from multiple discriminators at the end of each batch. With the inclusion of ensemble learning at the end of a VAE-GAN, both generated image quality and classification accuracy are improved simultaneously. We also introduce a simplified version of the Descriptive Distribution Distance (DDD) measure for evaluating any generative model, which better represents the distribution of the generated data and quantifies its closeness to the real data. Our experimental results on a number of different datasets in both the computer vision and medical imaging domains indicate that our MAVEN model improves upon the joint image generation and classification performance of a GAN and a VAE-GAN with the same set of hyper-parameters.

2 Related Work

Generative modeling has attracted much attention in the computer vision and medical imaging research communities. In particular, realistic image generation greatly helps address many problems involving the scarcity of labeled data. GANs and their variants have been applied in different architectures in continuing efforts to improve the accuracy and effectiveness of image classification. The GAN framework has been utilized in numerous works as a more generic approach to generating realistic training images that synthetically augment datasets in order to combat overfitting; e.g., for synthetic data augmentation in liver lesions [8], retinal fundi [10], histopathology [13], and chest X-rays [29]. Calimeri et al. [4] employed a LAPGAN [6] and Han et al. [11] used a WGAN [2] to generate synthetic brain MR images. Bermudez et al. [3] used a DCGAN [27] to generate 2D brain MR images followed by an autoencoder for image denoising. Chuquicusma et al. [5] utilized a DCGAN to generate lung nodules and then conducted a Turing test to evaluate the quality of the generated samples. GAN frameworks were also shown to improve accuracy of image classification via generation of new synthetic training images. Frid-Adar et al. [8] used a DCGAN and a ACGAN [25] to generate images of three liver lesion classes to synthetically augment the limited dataset and improve the performance of CNN for liver lesion classification. Similarly, Salehinejad et al. [29] employed a DCGAN to artificially simulate pathology across five classes of chest X-rays in order to augment the original imbalanced dataset and improve the performance of a CNN model in chest pathology classification.

The GAN framework has also been utilized in semi-supervised learning architectures to help leverage the vast number of unlabeled data alongside limited labeled data. The following efforts demonstrate how incorporating unlabeled data in the GAN framework has led to significant improvements in the accuracy of image-level classification: Madani et al. [18] used an order of magnitude less labeled data with a DCGAN in semi-supervised learning and showed comparable performance to a traditional supervised CNN classifier. Furthermore, their study also demonstrated reduced domain over-fitting by simply supplying unlabeled test domain images. Springenberg [31] combined a WGAN and CatGAN [33] for unsupervised and semi-supervised learning of feature representation of dermoscopy images.

Despite these successes, GAN frameworks are very difficult to train, as was discussed in the previous section. Our work mitigates the limitations of training the GAN framework; it enables training on a limited number of labeled data, prevents overfitting to a specific data domain source, prevents mode collapse, and enables multi-class image classification.

3 MAVEN Architecture

Fig. 2 illustrates the preliminary models building up to our MAVEN architecture.

The VAE is an explicit generative model that uses two neural nets—an encoder $E$ and decoder $D^{\prime}$ . Network $E$ learns an efficient compression of the real data point $x$ into a lower dimensional latent representation space $z(x)$ ; i.e., $q_{\lambda}(z|x)$ . With neural network likelihoods, computing the gradient becomes intractable. However via differentiable, non-centered re-parameterization, sampling is performed from an approximate function $q_{\lambda}(z|x)=N(z;\mu_{\lambda},\sigma_{\lambda}^{2})$ , where $z=\mu_{\lambda}+\sigma_{\lambda}\odot\hat{\varepsilon}$ with $\hat{\varepsilon}\sim N(0,1)$ . Encoder $E$ results in $\mu$ and $\sigma$ , and with the re-parameterization trick, $z$ is sampled from a Gaussian distribution. Then with $D^{\prime}$ , new samples are generated or real data samples are reconstructed. So, $D^{\prime}$ provides parameters for the real data distribution; i.e., $p_{\lambda}(x|z)$ . Later, a sample drawn from $p_{\phi}(x|z)$ may be used to reconstruct the real data by marginalizing out $z$ .

The GAN is an implicit generative model where a generator $G$ and a discriminator $D$ compete in a mini-max game over the training data to improve their performance. Generator $G$ tries to mimic the underlying distribution of the training data and generates fake samples while discriminator $D$ learns to discriminate fake generated samples from real samples. The GAN model is trained on the following objectives:

[TABLE]

$G$ takes a noise sample $z\sim p_{g}(z)$ and learns to map into image space as if they are coming from the original data distribution $p_{\text{data}}(x)$ . The discriminator $D$ takes either real image data or fake image data as the input and provides feedback to the generator $G$ , regarding whether the input to $D$ is real or fake. $D$ wants to maximize the likelihood for real samples and minimize the likelihood of generated samples. On the other hand, $G$ wants $D$ to maximize the likelihood of generated samples. A Nash equilibrium state is possible when $D$ can no longer distinguish real and generated samples meaning that the model distribution will be the same as the data distribution.

Makhzani et al. [19] proposed the adversarial training of VAEs; i.e., VAE-GANs. Although they kept both $D^{\prime}$ and $G$ , one can merge $D^{\prime}$ and $G$ since both can generate data samples from the noise samples of the representation $z$ . In this case, $D$ either receives generated samples $\tilde{x}$ via $G$ or fake samples $\hat{x}$ , and real data samples $x$ . Although $G$ and $D$ compete against each other, at some point the feedback from $D$ becomes predictable for $G$ and it keeps generating samples from the same class. At that time, the generated samples lack variety. Fig. 1c shows an example where all the generated images are of the same class. Durugkar et al. [7] proposed that using multiple discriminators in a GAN model helps improve performance, especially resolving the mode collapse issue. Moreover, a dynamic ensemble of multiple discriminators has recently been proposed, addressing the same issue [21].

In our MAVEN, the VAE-GAN combination is extended to have multiple discriminators aggregated in an ensemble layer. As in a VAE-GAN, the MAVEN has three components $E$ , $G$ , and $D$ ; all are convolutional neural networks with convolutional or transposed convolutional layers (Fig. 3). $E$ takes real samples and generates a dimensionality-reduced representation $z(x)$ . $G$ can take samples from noise distribution $z\sim p_{g}(z)$ or sampled noise $z(x)\sim q_{\lambda}(x)$ , and it generates fake or completely new samples. $D$ takes inputs from distributions of real labeled data, real unlabeled data, and fake generated data. Fractionally strided convolutions are performed in $G$ to obtain the image dimension from the latent code. The goal of an autoencoder is to maximize the Evidence Lower Bound (ELBO). The intuition here is to show the network more real data. The more real data that it sees, the more evidence is available to it and, as a result, the ELBO can be maximized faster. $K$ discriminators are collected in an ensemble layer and the combined feedback

[TABLE]

is passed to $G$ . In order to randomize the feedback from multiple discriminators, a single discriminator is randomly selected.

4 Semi-Supervised Learning

The overall training procedure of the proposed MAVEN model is presented in Algorithm 1. In the forward pass, the real samples to $E$ and noise samples to $G$ are presented multiple times for the presence of multiple discriminators. In the backward pass, the combined feedback from the $D$ s is determined and passed to $G$ and $E$ .

In the original image generator GAN, $D$ works as a binary classifier—it classifies the input image as real or synthetic. In order to facilitate the training for a $n$ -class classifier, the role of $D$ is changed to an $(n+1)$ -classifier. For multiple logit generation, the sigmoid function is replaced by a softmax function. Now, it can receive an image $x$ as input and outputs an $(n+1)$ -dimensional vector of logits $\{{l}_{1},{l}_{2},\dots,{l}_{n+1}\}$ . These logits are finally transformed into class probabilities for the final classification. Class ${(n+1)}$ is for the fake data and the remaining $n$ are for the multiple labels in the real data. The probability of $x$ being fake is

[TABLE]

and the probability that $x$ is real and belongs to class $i$ is

[TABLE]

As a semi-supervised classifier, the model only takes labels for a small portion of training data. For the labeled data, it is then like supervised learning, while it learns in an unsupervised manner for the unlabeled data. The advantage comes from generating new samples. The model learns the classifier by generating samples from different classes.

4.1 Losses

Three networks $E$ , $G$ , and $D$ are trained on different objectives. $E$ is trained on maximizing the ELBO, $G$ is trained on generating realistic samples, and $D$ is trained to learn a classifier that classifies fake generated samples or particular classes for the real data samples.

D Loss:

Since the model is trained on both labeled and unlabeled training data, the loss function of $D$ includes both supervised and unsupervised losses. When the model receives real labeled data, it is just the standard supervised learning loss

[TABLE]

When it receives unlabeled data from three different sources, the unsupervised loss contains the original GAN loss for real and fake data from two different sources: fake1 directly from $G$ and fake2 from $E$ via $G$ . The three losses

[TABLE]

and

[TABLE]

are combined as the unsupervised loss in $D$ :

[TABLE]

G Loss:

For $G$ , the feature loss is used along with the original GAN loss. Activation $f(x)$ from an intermediate layer of $D$ is used to match the feature between real and fake samples. Feature matching has shown a lot of potential in semi-supervised learning [30]. The goal of feature matching is to push the generator to generate data that matches real data statistics. The discriminator specifies those statistics; it is natural that $D$ can find the most discriminative features in real data against data generated by the model:

[TABLE]

The total $G$ loss becomes the combined feature loss and $G$ costs maximizing the log-probability of $D$ making a mistake for generated data (fake1/fake2). Therefore, the $G$ loss

[TABLE]

is the combination of three losses, (11),

[TABLE]

and

[TABLE]

E Loss:

In the encoder $E$ , the maximization of ELBO is equivalent to minimization of KL-divergence, allowing approximate posterior inferences. Therefore the loss function includes the KL-divergence and also a feature loss to match the features in the fake2 data with the real data distribution. The loss for the encoder is

[TABLE]

where

[TABLE]

and

[TABLE]

5 Experiments and Results

5.1 Data

We used three datasets to evaluate our MAVEN model for image generation and automatic image classification from 2D images in a semi-supervised learning scheme, and we constrained the experiments to limited labeled training data, considering that a large portion of annotation is missing; specifically:

The Street View House Numbers (SVHN) dataset [22]. There are 73,257 digit images for training and 26,032 digit images for testing in the SVHN dataset. Out of two versions of the images, we used the version which has MNIST-like $32\times 32$ pixel images centered around a single character, in RGB channels. Each of the training and test images are labeled as one of the ten digits (0–9). 2. 2.

The CIFAR-10 dataset [17], which consists of 60,000 $32\times 32$ pixel color images in 10 classes. There are 50,000 training images and 10,000 test images in the CIFAR-10 dataset. This is a 10-class classification with classes airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. 3. 3.

The anterior-posterior Chest X-Ray (CXR) dataset [15] for the classification of pneumonia and normal images. We performed 3-class classification: normal, bacterial pneumonia, and virus pneumonia. The dataset contains 5,216 training and 624 test images.

5.2 Implementation Details

To compare the image generation and multi-class classification performance of our MAVEN model, we used two baselines: DC-GAN and VAE-GAN. The same generator and discriminator architectures were used for DC-GAN and MAVEN models and the same encoder was used for the VAE-GAN and MAVEN models. For our MAVENs, we experimented with 2, 3, and 5 discriminators. In addition to using the proposed mean feedback of the multiple discriminators, we also experimented with feedback from a randomly selected discriminator. All the models were implemented in TensorFlow and run on a single Nvidia Titan GTX (12GB) GPU. For the CXR dataset, the images were normalized and resized to $128\times 128$ pixels before passing them to the models, while for the SVHN and CIFAR-10 datasets, the normalized images were passed to the models in their original $(32\times 32\times 3)$ pixel sizes. For the discriminator, after every convolutional layer, a dropout layer was added with a dropout rate of 0.4. For all the models, we consistently used the Adam optimizer with a learning rate of $2e-4$ for $G$ and $D$ , and $1e-5$ for $E$ with a momentum of 0.5. All the convolutional layers were followed by batch normalizations. Leaky ReLU activations were used with $\alpha=0.2$ . For all the experiments, only 10% training data were used along with the corresponding labels. The classification performance was measured with cross-validation and average scores were reported after running each model 10 times.

5.3 Evaluation

Image Generation Performance:

There are no perfect performance metrics for the unsupervised learning in measuring the quality of generated samples. However, to assess the quality of the generated images, we employed the widely used Fréchet Inception Distance (FID) [12] and a simplified version of the Descriptive Distribution Distance (DDD) [14]. To measure the Fréchet distance between two multivariate Gaussians, the generated samples and real data samples are compared through their distribution statistics:

[TABLE]

Two distribution samples are calculated from the 2048-dimensional activations of the pool3 layer of Inception-v3 [30]. DDD measures the closeness of a synthetic data distribution to a real data distribution by comparing descriptive parameters from the two distributions. We propose a simplified version based on the first four moments of the distributions, computed as the weighted sum of normalized differences of moments, as follows:

[TABLE]

The higher-order moments are weighted more, as the stability of a distribution can be better represented by them. For both FID and DDD, lower scores are better.

Image Classification Performance:

To evaluate model performance in classification, we used two measures: image-level classification accuracy and class-wise F1 scoring. The F1 score is

[TABLE]

with

[TABLE]

where TP, FP, and FN are the number of true positives, false positives, and false negatives, respectively.

5.4 Results

5.4.1 SVHN

For the SVHN dataset, we trained the network on $32\times 32$ pixel images. From the training set, we randomly picked 7,326 labeled images and the remaining unlabeled images were passed to the network. All the models were trained for 150 epochs and then evaluated. We generated an equal number of new images as the training set size. Fig. 5 presents a qualitative comparison of the generated digit images from the DC-GAN, VAE-GAN, and ALEAN models relative to the real training images, suggesting that our MAVEN-generated images are more realistic.

This was further confirmed by the FID and DDD scoring. FID and DDD measurement was performed by drawing 10,000 samples from the generated images and 10,000 samples from the real training images. The generated image quality measurement was performed for eight different models, and the resultant FID and DDD scores are reported in Table 1. For FID score calculation, the FID score is reported after running the pre-trained Inception-v3 network for 20 epochs for each model. Per the scores, the MAVEN-rand model with 3 discriminators achieved the best FID and the best DDD was achieved for the MAVEN-mean model with 5 discriminators.

For the semi-supervised classification, both image-level accuracy and class-wise F1 scores were calculated. Table 2 compares the classification performance of all the models for the SVHN dataset. The MAVEN model consistently outperformed the DC-GAN and VAE-GAN classifiers both in classification accuracy and class-wise F1 scores. Among all the models, our MAVEN-mean model with 2 and 3 discriminators were found to be the most accurate.

5.4.2 CIFAR-10

For the CIFAR-10 dataset, all the models were trained for 300 epochs and then evaluated. We generated an equal number of new images as the training set size. Fig. 6 visually compares the generated images from the GAN, VAE-GAN, and ALEAN models relative to the real training images.

The FID and DDD measurements were performed with the distribution of 10,000 samples drawn from the generated images and 10,000 samples from the real training images. For the FID score calculation, the pre-trained Inception-v3 network was run for 20 epochs and the FID score was recorded. The FID and DDD scores are reported in Table 1. As the tabulated results suggest, our proposed MAVEN models achieved better FID scores than some of the recently published models. Note that, those models were implemented in a different settings. As for the visual comparison, the FID and DDD scores confirmed more realistic image generation with our ALELAN models than the DC-GAN and VAE-GAN models. Except for MAVEN-mean with 2 discriminators, all other MAVEN models have smaller FID scores; MAVEN-rand with 3 discriminators has the smallest FID score among all the models.

For the semi-supervised classification, both image-level accuracy and class-wise F1 scores were calculated. Table 3 compares the performance of all the models for the CIFAR-10 dataset.

5.4.3 CXR

For the CXR dataset, all the models were trained for 150 epochs and then evaluated. We generated an equal number of new images as the training set size. Fig. 7 presents a visual comparison of synthesized and real image samples.

The FID and DDD measurements were performed for distribution of generated and real training samples, indicating that more realistic images were generated by the MAVEN models than by the GAN and VAE-GAN models. The FID and DDD scores presented in Table 1 show that the mean MAVEN model with 3 discriminators (MAVEN-mean3D) has the smallest FID and DDD scores.

The classification performance reported in Table 4 suggests our proposed MAVEN model-based classifers are more accurate than the basline GAN and VAE-GAN classifiers. Among all the models, MAVEN-mean classifier with 3 discriminators found to be the most accurate in classifying the B-pneumonia and V-pneumonia from normal. However, the overall performance is not so good for the CXR dataset compared to the natural image datasets. A possible reason could be the shortage of data and the omission of a larger portion of the labels. The main issue in the medical image dataset is that, unlike natural images, every case is different than others, even though they are labeled as the same class. It may be possible to resolve this by augmenting the training set with the generated images from each of the models. However, the goal of our present work was to devise a generative model architecture that could be equally competitive as a generator and a classifier. Even with the relatively smaller dataset, the proposed MAVEN models perform better than the baseline models.

6 Conclusions

We have demonstrated the advantages of an ensemble of discriminators in the adversarial learning of variational autoencoders and the application of this idea to semi-supervised classification from limited labeled data. Training our new MAVEN models on a small, labeled dataset and leveraging a large number of unlabeled examples, we have shown superior performance relative to prior GAN and VAE-GAN based classifiers, suggesting that our MAVEN models can be very effective in concurrently generating high-quality realistic images and improving multi-class classification performance. However, it remains an open problem to find the optimal number of discriminators that can perform consistently. Our future work will consider more complex image analysis tasks beyond classification and include more extensive experimentation spanning additional domains.

Appendix A Comparison of Distributions

Through histogram-density diagrams, Fig. 4 compares the distributions of each of the models against the real distribution, showing that the distributions of images synthesized by our MAVENs are generally closer to the real image distributions for the SVHN, CIFAR-10, and CXR datasets.

Appendix B Comparison of Images

Figs. 5, 6, and 7 present visual comparisons of image samples from the SVHN, CIFAR-10, and CXR datasets, respectively, relative to those generated by the different models.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2Arjovsky et al. [2017] Arjovsky, M., Chintala, S. and Bottou, L. [2017], ‘Wasserstein GAN’, ar Xiv preprint ar Xiv:1701.07875 .
3Bermudez et al. [2018] Bermudez, C., Plassard, A. J., Davis, L. T., Newton, A. T., Resnick, S. M. and Landman, B. A. [2018], Learning implicit brain MRI manifolds with deep learning, in ‘Medical Imaging 2018: Image Processing’, Vol. 10574, International Society for Optics and Photonics, p. 105741 L.
4Calimeri et al. [2017] Calimeri, F., Marzullo, A., Stamile, C. and Terracina, G. [2017], Biomedical data augmentation using generative adversarial neural networks, in ‘International Conference on Artificial Neural Networks’, Springer, pp. 626–634.
5Chuquicusma et al. [2018] Chuquicusma, M. J., Hussein, S., Burt, J. and Bagci, U. [2018], How to fool radiologists with generative adversarial networks? A visual turing test for lung cancer diagnosis, in ‘Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on’, IEEE, pp. 240–244.
6Denton et al. [2015] Denton, E. L., Chintala, S., Szlam, A. and Fergus, R. [2015], Deep generative image models using a Laplacian pyramid of adversarial networks, in ‘NIPS’.
7Durugkar et al. [2016] Durugkar, I., Gemp, I. and Mahadevan, S. [2016], ‘Generative multi-adversarial networks’, ar Xiv preprint ar Xiv:1611.01673 .
8Frid-Adar et al. [2018] Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J. and Greenspan, H. [2018], ‘GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification’, ar Xiv preprint ar Xiv:1803.01229 .