Image Decomposition and Classification through a Generative Model

Houpu Yao; Malcolm Regan; Yezhou Yang; Yi Ren

arXiv:1902.03361·cs.CV·February 12, 2019

Image Decomposition and Classification through a Generative Model

Houpu Yao, Malcolm Regan, Yezhou Yang, Yi Ren

PDF

Open Access

TL;DR

This paper introduces a generative model that enhances image classification robustness by decomposing inputs into components, effectively handling adversarial attacks and distribution shifts, demonstrated on multiple datasets.

Contribution

A conditional variational autoencoder is proposed to simultaneously decompose images and classify components, improving robustness against adversarial and distributional challenges.

Findings

01

Effective decomposition of overlapping components in multiMNIST.

02

High robustness to gradient and non-gradient attacks on MNIST and NORB.

03

Successful recognition of novel component combinations in traffic signs.

Abstract

We demonstrate in this paper that a generative model can be designed to perform classification tasks under challenging settings, including adversarial attacks and input distribution shifts. Specifically, we propose a conditional variational autoencoder that learns both the decomposition of inputs and the distributions of the resulting components. During test, we jointly optimize the latent variables of the generator and the relaxed component labels to find the best match between the given input and the output of the generator. The model demonstrates promising performance at recognizing overlapping components from the multiMNIST dataset, and novel component combinations from a traffic sign dataset. Experiments also show that the proposed model achieves high robustness on MNIST and NORB datasets, in particular for high-strength gradient attacks and non-gradient attacks.

Figures14

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1 : Comparisons between baseline (binarization) and the proposed method on robustness under FGM attacks with increasing perturbation levels.

	$ϵ$	0.0	0.1	0.2	0.3	0.4
MNIST	baseline	0.97	0.95	0.90	0.76	0.51
MNIST	proposed	0.95	0.87	0.91	0.87	0.82
NORB	baseline	59.8	18.3	2.8	0.8	0.2
NORB	proposed	87.1	61.9	40.1	24.6	18.8

Equations6

E_{{x^{r}, y} \sim D} [∣∣ x^{f} (z, y, ϕ) - x^{r} ∣ ∣_{2}^{2} + C i \sum n y_{i}^{T} K L (z_{i})]

E_{{x^{r}, y} \sim D} [∣∣ x^{f} (z, y, ϕ) - x^{r} ∣ ∣_{2}^{2} + C i \sum n y_{i}^{T} K L (z_{i})]

z^{*}, y^{*} = argmin_{z \in R^{p}, y \in {0, 1}^{n}} f (z, y, x^{r}), f (z, y, x^{r}) \equiv ∥ x^{f} (z, y, ϕ) - x^{r} ∥_{2}^{2} .

z^{*}, y^{*} = argmin_{z \in R^{p}, y \in {0, 1}^{n}} f (z, y, x^{r}), f (z, y, x^{r}) \equiv ∥ x^{f} (z, y, ϕ) - x^{r} ∥_{2}^{2} .

z^{*}, y^{*} = argmin_{z \in R^{p}, y \in [0, 1]^{n}} F (z, y, x^{r}) F (z, y, x^{r}) = ∥ S i g (x^{f} (z, y, ϕ)) - S i g (x^{r}) ∥_{2}^{2} + c ∥ y ∥_{1},

z^{*}, y^{*} = argmin_{z \in R^{p}, y \in [0, 1]^{n}} F (z, y, x^{r}) F (z, y, x^{r}) = ∥ S i g (x^{f} (z, y, ϕ)) - S i g (x^{r}) ∥_{2}^{2} + c ∥ y ∥_{1},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning

MethodsSolana Customer Service Number +1-833-534-1729

Full text

Image Decomposition and Classification through a Generative Model

Abstract

We demonstrate in this paper that a generative model can be designed to perform classification tasks under challenging settings, including adversarial attacks and input distribution shifts. Specifically, we propose a conditional variational autoencoder that learns both the decomposition of inputs and the distributions of the resulting components. During test, we jointly optimize the latent variables of the generator and the relaxed component labels to find the best match between the given input and the output of the generator. The model demonstrates promising performance at recognizing overlapping components from the multiMNIST dataset, and novel component combinations from a traffic sign dataset. Experiments also show that the proposed model achieves high robustness on MNIST and NORB datasets, in particular for high-strength gradient attacks and non-gradient attacks.

**Index Terms— ** Generative model, classification, adversarial defense

1 Introduction

Neural network architectures have been developed to achieve human-level performance on standard vision tasks [1, 2]. However, it is acknowledged that feedforward networks have difficulty at generalization under input distribution shifts, e.g., novel object sets [3] and objects with overlaps [4]. Furthermore, studies have shown that networks, even with high standard test accuracy, can suffer from imperceptible adversarial attacks [5]. While neither distribution shifting or adversarial attacks are common cases in standard test environments for image classifiers [6, 7], the demonstrated risk of existing models have raised concerns over their real-world applications (e.g., autonomous driving and security surveillance) where failed classification for a short amount of time can be catastrophic. These concerns have led to the question of whether semantic attributes or physical components of the inputs are truly understood by feedforward, albeit deep, networks [1, 2]. Indeed, evidence showed that state-of-the-art classification models have a drastically different accuracy changing pattern than human beings in classifying image sequences with diminishing details [8], suggesting that the two have different feature learning behaviors.

This concern over model generalizability and robustness is intrinsic to classifiers that perform bottom-up signal processing. Alternative models that integrate bottom-up processing with top-down reasoning through recursive inference have been studied [9, 10]. Of particular interest are recursive compositional models (e.g., AND-OR templates [10]) that learn to match deformable objects and infer graph states for detection and recognition. These models take advantage of highly structured generators, e.g., by explicitly modeling object edges and surfaces. However, the intrinsic trade-off between model and computational complexity (for both model learning and recognition/classification) may hamper their application to general inputs for which the recognition or classification tasks depend on a richer set of features. With the rapid advance in generative models (e.g. GAN [11], PixelCNN [12], and VAE [13, 14]), it is tempting to investigate top-down classification mechanisms that incorporate more flexible generators than compositional models.

To this end, we present in this paper a classification algorithm where input images are classified by minimizing the difference between the target image and the output of a generator. We choose Conditional Variational Autoencoder (c-VAE) as the backbone of our model due to its simplicity and good convergence property. Following the argument that attribute recognition is key to object classification [15, 16], we focus on the classification of attributes, or more specifically, the existence of object components from the input image, while assuming that the follow-up mapping from these components to the object class is established. As illustrated in Fig. 1, the model is built on a variational autoencoder whose decoder is composed of separate sub-networks, where each component is associated with a sub-network. Experiments show that the proposed model demonstrates promising performance at recognizing overlapping components from the multiMNIST dataset, and novel component combinations from a traffic sign dataset. The model also achieves high robustness on MNIST and NORB datasets, in particular for high-strength gradient attacks and non-gradient attacks.

2 related work

Conditional generative models have been heavily investigated in the literature. For example, VAE structure [13] is extended into c-VAE [14] by including attribute variables as extra inputs. Besides, the latent variables are disentangled into two sub-networks in [14] to perform foreground and background separation. In our work, we further extended the c-VAE idea to have more sub-networks where each sub-network corresponds to a component. And instead of generation or segmentation, we investigate the possibility to use the conditional generative model to perform robust classification.

The utility of generative model as a defense mechanism has been explored very recently. For example, defense GAN [17] finds the closest generation from a GAN to a given input image, and feeds the generated image to a vanilla classifier for prediction. Similar to our approach, Schott et al. [18] chose VAE to perform image generation and used the reconstruction error as a metric to perform classification directly, without feedforward classifier involved. Compared with [18], the proposed model has an optimized network structure and classification workflow, which makes it able to learn object decomposition and handle more tasks like novel or overlapping object recognition.

3 Proposed Method

3.1 Customized conditional variational autoencoder

We customize a conditional variational autoencoder to learn the decomposition of object components. Let $x^{r}$ and $y$ be the input images and their corresponding components, respectively. A binary component vector $y$ encodes the existence of components within $x^{r}$ . It is assumed that the maximum amount of possible components in the dataset is $n$ . Let $z=Enc(x^{r},\theta)\in\mathbb{R}^{p}$ be the latent variable generated from the encoder $Enc(\cdot,\theta)$ with parameters $\theta$ , and $x^{f}(z,y,\phi)=\sum_{i}^{n}y_{i}Dec_{i}(z_{i},\phi_{i})$ be the generated images from the set of decoders $Dec_{i}(\cdot,\phi_{i})$ with parameters $\phi_{i}$ . $z_{i}\in\mathbb{R}^{p/n}$ is a segment of $z$ corresponding to the $i$ th component. Following the formulation of variational autoencoders, we define our training loss as:

[TABLE]

where $KL(z_{i})$ is the sub-network-wise KL divergence of the distribution of $z_{i}$ from a standard normal distribution. In the presented experiments, Adam optimizer with a learning rate of 0.001 is used for training.

3.2 Classification

Based on the learned generative model, the classification process can be cast into an inverse problem, where we jointly find $z^{*}$ and $y^{*}$ that minimize the difference between the generated image $x^{f}$ and the target image $x^{r}$ in the image space:

[TABLE]

Note that Eq. (2) is a combinatorial problem, since $y$ is a binary vector. To make this problem feasible, we relax component labels to be continuous within $[0,1]$ , so that the optimization problem becomes differentiable and can be solved efficiently through gradient descent and back-propagation. In addition, a $L_{1}$ penalty is introduced to regularize non-zero components with weight $c$ . The regularization is needed so that a component will be correctly discarded (i.e., sub-network outputs suppressed) if its addition to the generated image does not improve the reconstruction quality significantly. Lastly, it is also found that minimizing the image-wise difference after a sigmoid transformation over the reconstruction further improves the defense performance. With these modifications, the classifier solves the following problem:

[TABLE]

where $Sig(x)=\text{sigmoid}\big{(}\beta\cdot x+b\big{)}$ , and $\beta$ , $b$ are hyper-parameters.

Upon convergence, $z^{*}$ and $y^{*}$ obtained via Eq. (3) will be processed to derive the final classification result using the following logic flow: First, if the lowest reconstruction loss found is larger than a pre-set threshold $l$ , the input image will be classified as noise. This step is able to filter out adversarial attacks using non-gradient methods such as Neuroevolution of Augmenting Topologies (NEAT) [19]. After this initial filtering, we apply two thresholds $y_{l}$ and $y_{u}$ ( $y_{l}<y_{u}$ ) to $y^{*}$ : an element $y^{*}_{i}$ is set to [math] if $y^{*}_{i}<y_{l}$ , and 1 if $y^{*}_{i}>y_{u}$ . For the remaining elements of $y^{*}$ between $y_{l}$ and $y_{u}$ , we will enumerate over all binary combinations of this subset to generate a candidate set $\mathcal{Y}$ for fine-tuning the prediction. Specifically, we compute $z^{\dagger}(y)=\text{argmin}_{z\in\mathbb{R}^{p}}f(z,y,x^{r})$ for $y\in\mathcal{Y}$ and set the classification result as $y^{\dagger}=\text{argmin}_{y\in\mathcal{Y}}f(z^{\dagger}(y),y,x^{r})$ . All hyper-parameters ( $\beta$ , $b$ , $l$ , $y_{l}$ , and $y_{u}$ ) are determined based on a validation set, and are therefore dataset dependent.

3.3 Implementation details

The proposed model will be tested on MNIST, NORB [20], multiMNIST [4], and a Traffic Sign (TS) dataset. For NORB, similar to [4], we downscale the image resolution to 48-by-48. The multiMNIST dataset is synthesized by stacking two MNIST images into an image of resolution 36-by-36, which will result in 80% overlap between two digits on average. Every image label in multiMNIST has 2 ones and 8 zeros. A visualization of the multiMNIST dataset can be found in Fig. 4. The TS dataset is prepared with affine transformation to mimic traffic signs at different view angles. There are four different types of images in the dataset: “left turn”, “right turn”, “no left turn”, and “no right turn”. These images are in a resolution of 64-by-64, with examples visualized in Fig. 3. There are four components for TS images to represent the components: circle, slash, left arrow, and right arrow. For example, a “No right turn” image has label $[1,1,0,1]$ since it has circle, slash, and right arrow.

The number of sub-networks is equal to the number of total components in the dataset, which is 10 for MNIST and multiMNIST, 5 for NORB, and 4 for TS. Once the network has been trained, it is expected to decompose the input object into different components, as illustrated in Fig.1. To demonstrate the efficacy of the learning of decomposition, we visualize the latent space of each sub-networks trained on different datasets in Fig.2. One observation from the results is that the model is able to remove data redundancy automatically, which can be seen from the TS dataset: To recall, the first component of a TS image represents the existence of a circle. Since the circle exists in all TS images, the learned generative model combines the circle into the generation of the arrows and leaves the first sub-network blank.

4 Results and Discussion

In this section, we demonstrate the efficacy of our method through a set of challenging experiments. These include classifying overlapping and novel objects, and defense against both gradient-based (FGM [5]) and non-gradient based (NEAT [19]) adversarial attacks.

4.1 Classification of novel and overlapping objects

We first train our model on the TS dataset by only using “left turn”, “right turn” and “no right turn” images for training. After the model is trained, its performance is evaluated on “No left turn” images. This experiment is set up in a way that the network is asked to recognize novel objects not included in the training set by making use of the learned decomposition of image elements.

A traditional feedforward CNN classifier will completely fail under this setting, where 99% of the “no left turn” images are classified as “left turn”. This is due to the fact that the classifier will associate the “left arrow” feature with the “left turn” category during the training phase, and this correlation strongly affects the prediction during the testing phase. In contrast, our proposed model learns to decompose the TS dataset into components after training (as shown in Fig. 2b), and by combining these components together, it correctly recognizes 95% of all novel testing images. The classification result is shown in Fig. 3.

We now demonstrate that the proposed model is able to handle objects with large overlap using the multiMNIST dataset. After training, the proposed model learns the decomposition of individual digits successfully as shown in Fig.2a. Applying the learned model to classifying the test dataset leads to a classification accuracy of 65.6%. Successful and failed test samples and their classification results are shown in Fig.4. While the accuracy on multiMNIST is lower than that of a CapsNet [4](95%), investigation of the model performance shows that many of the misclassified samples are truly difficult to be separated even for human beings. It should also be noted that the accuracy of CapsNet is resulted from 60 million training data, while our model successfully decomposes the learns the component-wise latent spaces with only 128k data points. We expect improved classification accuracy by increasing the capacity of the generative model.

4.2 Defending adversarial attacks

We test the performance of our model under both gradient- and non-gradient-based adversarial attacks in this experiment. FGM and NEAT are chosen as the attack methods. NEAT performs evolution on Compositional Pattern Producing Networks (CPPN) [21], each of which represents an image. The fitness of the evolution is defined as $1-||\hat{y}-y_{target}||$ , where $\hat{y}$ is the classification result of a CPPN image by the source classification model and $y_{target}$ is the target class for the attack. If an adversarial image is misclassified with over $90\%$ confidence, the attack is considered successful and this adversarial image will be used for test.

We first compare our method with the binarization defense [22] under FGM attack in Tab.1. The performance of binarization drops quickly with increasing attack magnitudes, while the proposed model is able to maintain high accuracy even under relatively high attack magnitude. To further examine the cases where our model fails, we visualized adversarial MNIST samples with $\epsilon=0.4$ in Fig.5a, along with correctly classified samples. We note that many of the misclassified images are hard to be recognized even for human beings.

The comparison on NEAT attacks is shown in Fig.5b. The baseline classifier is fooled completely by the adversarial images with consistently high confidence for the targeted labels (shown in the first row of Fig.5b). Even after binarization, many NEAT images can still be misclassified with high confidence (shown in the second row of Fig.5b). In contrast, the proposed method is able to identify 90.6% of these attacks. This is because the generative model is not trained on these highly structured CPPN patterns, and is not able to reconstruct them during the test. Due to the high reconstruction error, the proposed model will conclude that these input images are out of the data distribution.

4.3 Discussion

While the presented experiments show promise of the proposed classification method, a deeper investigation is needed to better characterize the applicability and limitations of the proposed method to more complicated datasets (e.g., CIFAR10 and ImageNet) under more sophisticated settings (e.g. PGD adversarial attack [23]). Further improvements can be made to compress the generative model in order to accelerate the optimization during classification. Generative models with the capability to create fine-grained details should be studied and incorporated to further improve the performance.

5 Conclusions

In this paper, we investigate the utility of a tailored conditional variational autoencoder as a classifier, and test its generalizability and robustness under challenging tasks. Results show that the proposed training and classification formulations lead to promising performance: First, the model can recognize overlapping objects and novel component combinations that do not exist in the training phase. Second, our model is able to defend against adversarial attacks well, in particular under higher attack magnitudes and under none-gradient attacks.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778.
2[2] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2016, pp. 2818–2826.
3[3] Simon M Stringer and Edmund T Rolls, “Invariant object recognition in the visual system with novel views of 3d objects,” Neural Computation , vol. 14, no. 11, pp. 2585–2596, 2002.
4[4] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton, “Dynamic routing between capsules,” in Advances in Neural Information Processing Systems , 2017, pp. 3859–3869.
5[5] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” ar Xiv preprint ar Xiv:1412.6572 , 2014.
6[6] Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami, “Practical black-box attacks against deep learning systems using adversarial examples,” ar Xiv preprint ar Xiv:1602.02697 , 2016.
7[7] Jiajun Lu, Hussein Sibai, Evan Fabry, and David Forsyth, “Standard detectors aren’t (currently) fooled by physical adversarial stop signs,” ar Xiv preprint ar Xiv:1710.03337 , 2017.
8[8] Shimon Ullman, Liav Assif, Ethan Fetaya, and Daniel Harari, “Atoms of recognition in human and computer vision,” Proceedings of the National Academy of Sciences , vol. 113, no. 10, pp. 2744–2749, 2016.