Understanding Adversarial Behavior of DNNs by Disentangling Non-Robust and Robust Components in Performance Metric
Yujun Shi, Benben Liao, Guangyong Chen, Yun Liu, Ming-Ming Cheng,, Jiashi Feng

TL;DR
This paper introduces a metric that separates the generalization performance of DNNs into robust and non-robust components, revealing how current models rely on non-robust features and how adversarial training enhances robustness.
Contribution
The work proposes a novel information-theoretic metric to disentangle robust and non-robust components influencing DNN performance, providing insights into adversarial vulnerability and robustness.
Findings
Current DNNs depend heavily on non-robust features for performance.
Adversarial training suppresses reliance on non-robust components.
The metric offers a new perspective on balancing accuracy and robustness.
Abstract
The vulnerability to slight input perturbations is a worrying yet intriguing property of deep neural networks (DNNs). Despite many previous works studying the reason behind such adversarial behavior, the relationship between the generalization performance and adversarial behavior of DNNs is still little understood. In this work, we reveal such relation by introducing a metric characterizing the generalization performance of a DNN. The metric can be disentangled into an information-theoretic non-robust component, responsible for adversarial behavior, and a robust component. Then, we show by experiments that current DNNs rely heavily on optimizing the non-robust component in achieving decent performance. We also demonstrate that current state-of-the-art adversarial training algorithms indeed try to robustify the DNNs by preventing them from using the non-robust component to distinguish…
| VGG13 | VGG13 | VGG13 | resnet20x1 | resnet32x10 | |
|---|---|---|---|---|---|
| pure linear | half linear | normal | |||
| accuracy | 35.14 | 70.48 | 72.36 | 72.88 | 79.45 |
| (standard) | |||||
| accuracy | 17.06 | 38.52 | 42.29 | 44.94 | 49.52 |
| (PGD attack, ) | |||||
| accuracy | 13.94 | 34.12 | 38.96 | 41.49 | 47.67 |
| (CW attack, ) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Integrated Circuits and Semiconductor Failure Analysis
Understanding Adversarial Behavior of DNNs by Disentangling Non-Robust and Robust Components in Performance Metric
Yujun Shi
Nankai University
&Benben Liao
Tencent
[email protected] &Guangyong Chen
Tencent
[email protected] &Yun Liu
Nankai University
[email protected] &Ming-Ming Cheng
Nankai University
&Jiashi Feng
National University of Singapore
[email protected] Work done when Yujun Shi was interning in Tencent.Benben Liao has made the equal contribution to this work.Corresponding author.
Abstract
The vulnerability to slight input perturbations is a worrying yet intriguing property of deep neural networks (DNNs). Despite many previous works studying the reason behind such adversarial behavior, the relationship between the generalization performance and adversarial behavior of DNNs is still little understood. In this work, we reveal such relation by introducing a metric characterizing the generalization performance of a DNN. The metric can be disentangled into an information-theoretic non-robust component, responsible for adversarial behavior, and a robust component. Then, we show by experiments that current DNNs rely heavily on optimizing the non-robust component in achieving decent performance. We also demonstrate that current state-of-the-art adversarial training algorithms indeed try to robustify the DNNs by preventing them from using the non-robust component to distinguish samples from different categories. Also, based on our findings, we take a step forward and point out the possible direction for achieving decent standard performance and adversarial robustness simultaneously. We believe that our theory could further inspire the community to make more interesting discoveries about the relationship between standard generalization and adversarial generalization of deep learning models.
1 Introduction
Deep neural networks (DNNs) have achieved enormous success in many different tasks [14] over the last decade. There is a major line of works trying to boost the performance of deep learning models from different aspects [12, 25, 27, 9, 8, 11]. While the community is devoting to achieve new state-of-the-art performance with DNNs, some researchers [28] identify these powerful models are susceptible to perturbations that are even imperceptible by human. Consequently, a solid body of works have been developed on both finding the most effective attacks [7, 19, 13, 33] as well as obtaining relatively more adversarial robust models [7, 18, 17, 32]. Specifically, in [33], the authors provide theoretic derivation on how Fisher information of the model’s output w.r.t input characterizes the adversarial behavior around input.
Exploiting the model’s ability to stay adversarial robust [18, 17] also reveals a non-trivial degradation in standard performance as a cost of being relatively more adversarial robust, which has attracted much attention recently. Some works [24, 29, 26, 20, 10] have emerged to try to understand this apparently trade-off between standard generalization and adversarial generalization. For example, in [10], the authors provide a novel view point that adversarial samples are non-robust features that could help the generalization of deep learning models and validate their conjecture by experiments.
In this work, however, we theoretically reveal the relationship between standard performance as well as adversarial behavior of deep learning models with Taylor expansion of Kullback–Leibler divergence (KL divergence) and Fisher information. Interestingly, we show that the overall performance objective could be disentangled into a non-robust component, which has the side effects of causing adversarial behavior, as well as a robust component. Our analysis shows that it is indeed the fact that adversarial robustness and high standard performance are contradictory as the non-robust component, which contribute to the standard performance also cause adversarial behavior.
The accuracy on test set is usually used to measure the performance of machine learning models for classification tasks. In this work, to investigate the adversarial behavior of DNNs, we propose to transform such performance metric into the KL divergence between output distributions of test set samples from different categories. This new objective could not only perfectly characterize how well the model distinguishes samples from different categories, but also connect model performance and its adversarial behavior tightly. The developed theory conveys that the overall performance objective could be disentangled into a non-robust component, which has the side effects of causing adversarial behavior, as well as a robust component.
We also demonstrate by experiments that current deep learning models rely heavily on the non-robust component to generalize, which is the underlying reason for adversarial behavior in trained deep learning models. In addition, we show that state-of-the-art adversarial training algorithms are all trying to constrain the model from using the non-robust components. Based on the above findings, we suggest that there might exist a perfect balance point for the deep learning models to possess decent standard generalization ability while stay adversarial robust.
Our contribution is summarized as follows.
- •
We propose a new metric that could both characterize the standard performance and better connect with adversarial behavior of DNNs.
- •
By properly expanding our metric with Fisher information, we quantitatively explain the relationship between standard performance and adversarial robustness of DNNs.
- •
We then take a step forward and point out possible direction for achieving decent standard performance and adversarial robustness simultaneously.
2 Related Work
Adversarial Attacks and Defense
To study the adversarial behavior of deep learning models, many algorithms in terms of both finding the most effective adversarial perturbation and improving adversarial robustness of the deep learning system have been proposed recently. [7] proposes an one-step attack algorithm based on gradient of the deep learning models called fast gradient sign method (FGSM), [21] propose an attack algorithm based on jacobian saliency map, [19] proposes an attack algorithm based on Newton’s iterative algorithm for finding roots of a non-linear function. [22] proposes a training mechanism for defense based on distillation while [3] proposes the "CW attack" that could render the distillation defense mechanism useless. [13] proposes an iterative version of the FGSM attack, and [17] suggests that the adversarial training based on projected gradient descent attack algorithm has universal defense effects. [18] proposes the distribution smoothing training strategy for defense in both supervised and semi-supervised learning setting. [33] studies the adversarial attack under the Fisher information metric and uses the power method to solve eigenvectors of the Fisher information matrix, where the eigenvector of the greatest eigenvalue is treated as the attack noise.
Fisher Information in Deep Learning
[4] uses Fisher information matrix as a metric that induces distance on parameter manifold and develops an regularization term for incremental learning. The concept of Fisher information is also widely applied in deep learning algorithm based on natural gradient descent [23, 6], as well as meta-learning [2, 1]. In these works, they usually view the parameter of the model as the parameter of the Fisher information matrix. Recently, [18, 33] propose to use Fisher information to represent the local geometry of the log-likelihood landscape of the deep learning model so that the adversarial behavior could be better studied. In these two works, however, they view the input of the deep learning model as Fisher information’s parameter instead. We also adopt the same definition for Fisher information as [18, 33] to better study the relationship between the standard performance and adversarial robustness of the model.
Adversarial Behavior of DNNs
Previously, there are a line of works trying to explain the robust model’s degradation in standard performance [20, 26, 29, 24, 32]. In these work, they all start by studying the accuracy of the model, and sometimes with relatively strong assumptions. The recent literature [10] also empirically shows that adversarial samples are features that help the standard generalization of deep learning models by disentangling non-robust and robust features. In our work, we theoretically point out that it is indeed the fact that what causes adversarial behavior also helps boost the model’s performance with little assumptions. Our theory, however, doesn’t indicate that achieving decent performance and being adversarial robust are two completely contradict objectives. It is still possible that there exist a perfect balance point where both objectives could be satisfied simultaneously.
3 Disentanglement of the Performance Metric
We present our main results in this section. We first propose to transform the standard testing-accuracy based performance metric to a KL divergence based one. Then with the newly formed objective, we derive the performance of a DNN model is indeed determined by two disentangled components and it is one of them that introduces the model adversarial behavior. We finally conclude this section by presenting some complementary understandings from the viewpoint of information geometry.
3.1 Proposed performance metric
Prediction accuracy on test set is typically adopted to measure performance of a DNN model in the literature. However, such metric indeed hides the connection between the model performance and its adversarial behavior. In this work, in order to build more transparent connection and better understand model adversarial behavior, we propose to adopt the average KL divergence between output distribution of any pair of data of different categories as the classification performance metric.
We denote as the input image and as the corresponding one-hot label distribution, as the model and as the output distribution of the model, as the number of pairs of input data with different labels, as the Jensen-Shannon divergence between and . We propose to adopt the objective as Cross Category KL Divergence (CCKL):
[TABLE]
By Lin’s inequality [15], we can derive the following lower bound from triangular inequality for describing the relation between the widely used cross entropy loss and our proposed objective (1):
[TABLE]
where denotes the Jensen-Shannon divergence between and . This lower bound of effectively characterizes its behavior along the training process. As the lower bound indicates, the training error decreases so the whole lower bound increases during training process.
We can also view this from another perspective. When randomly initialized, a DNN does not have any knowledge for classifying samples correctly. Therefore, it does not distinguish different inputs , and the output distributions shall be similar, as shown in Figure 1. That is to say, shall be relatively small at the very beginning of the training stage. As training proceeds, more label dependent information is integrated into the model . The network becomes better for generalization and the output distribution on test set gets closer to the true label distribution . At late training stage, the model loss on average will decrease to relatively small value. By continuity of KL divergence, will be sufficiently close to as shown in Figure 1.
This proposed measure could better characterize how well the model distinguish the input data from data of other categories. To further expound on our point, we provide a visualization about this relationship in Figure 2.
3.2 Connections Between Adversarial Behavior and CCKL
We now explain how the above CCKL objective could be connected to the model’s adversarial behavior. In most previous literature [7, 13, 17], the following cross entropy loss of the model on adversarial samples (with perturbation )
[TABLE]
is adopted to study adversarial behavior of the DNNs. As training proceeds and the parameter approaches the optimum, the output distribution will be close to true label . In this situation, since KL divergence is continuous in the first variable so we have
[TABLE]
This observation inspires us to study the model adversarial behavior from a distribution point of view, as explored in [18, 33, 32]. Instead of viewing the output of as a single scalar, we treat the model as a function that outputs a prediction distribution over the input. Thus we use the KL divergence between the output distribution over the original samples and adversarial samples to characterize the model’s adversarial behavior.
In order to better connect adversarial behavior with the CCKL proposed before, we define the following adversarial measure built upon the above relation:
[TABLE]
The objective of the corresponding adversarial training is thus formulated as follow:
[TABLE]
Based on the distribution point of view, [33] reports state-of-the-art results in the task of adversarial attack and [32] reports the state-of-the-art results in adversarial robustness. Therefore, using (3) instead of the cross entropy loss on adversarial samples to represent adversarial behavior is further justified.
Given the definition of , applying Taylor expansion yields the following:
[TABLE]
where is the Fisher information of w.r.t. . Let be the -th entry of and be the number of entries of . Then can be calculated by:
[TABLE]
When is sufficiently small, higher order terms in the above would vanish and (5) could be simplified into:
[TABLE]
By setting , we obtain , where is the maximum eigenvalue of . Therefore, the solution to the overall adversarial objective corresponds to the leading eigenvector of . Consequently, we have:
[TABLE]
Note that here is also the spectral norm of the Fisher information matrix . The above derivation shows that the local adversarial behavior of the model around input is determined by the spectral norm of Fisher information matrix: the adversarial behavior around would be more severe if the spectral norm of is larger.
Given two data-label pairs and with , we could rewrite as:
[TABLE]
Therefore, we apply the same Taylor expansion as above and obtain:
[TABLE]
Comparing (10) and (7), we could further notice that they share the same Fisher information . Therefore, the adversarial behavior at each data point and performance objective CCKL could be connected by Fisher information at each data point.
Cramér-Rao bound The adversarial training proposed in the paragraphs above constrains the input-output Fisher information of a DNN model. This constrain is a criteria of a good DNN model due to the following reasons. Recall the well-known Cramér-Rao bound
[TABLE]
says that if we try to use the output probability to a statistics to reconstruct the input , the uncertainty in terms of variance is bounded below by the inverse of Fisher information . For a DNN model that represents the reality, when it classifies an image with a correct label, say a dog, the label does not have any information about the environments - what color the dog is, where is the dog, adversarial perturbation, etc. Therefore, one cannot use the information contained in the label to reconstruct the original image. This means that the variance of any statistics derived from output distribution is relatively large for a good DNN model. In view of Cramer-Rao bound, this implies that the Fisher information of a DNN is a relatively small value.
3.3 Disentanglement of the Performance Metric
In this section, we reveal the proposed performance objective, unifying measure of performance and robustness, could be decomposed into two components. To see this, we first denote in (10) as and the following terms as . Thus could be formulated as:
[TABLE]
Taking a closer look into (11), we could notice that the increase of and could both contribute to the rise of , which is the performance objective. Note that since is a second order polynomial induced by , and is fixed distance between two input and , the rise of would asymptotically result in the rise of norm of . That is to say, if the model rely heavily on the increase of to boost performance, the norm of would have to increase drastically. However, according to (8) and our derivation before, the rise in spectral norm of means more severe adversarial behavior around . The trade off between standard performance and adversarial behavior is thus clearly characterized here: the model could rely on to boost performance, but it comes with the side effect of more severe adversarial behavior. What should also be noted, however, is that also contributes to the overall performance objective while not involves in the adversarial objective. Therefore, relying on terms in to distinguish from data belonging to other categories would not cause adversarial behavior. Therefore, we successfully disentangled the non-robust component and robust component in the overall performance objective.
To further understand the role of in classification, we visualize how F-norm of Fisher information evolves during training. Note that we visualize F-norm instead of spectral norm because all norms are equivalent and spectral norm is not computation feasible in our case. We first empirically show how the average F-norm of Fisher information on the test set and standard test accuracy vary during nature training process. The visualization is in Figure 3. According to the statistics, we could observe that the norm of Fisher information increase drastically with the rise of accuracy, which indicates that current deep learning model rely heavily on the non-robust component to boost performance.
Then we compare nature training with the two state-of-the-art adversarial training algorithms [17, 32] using the same visualization method. The result is shown in Figure 4. It is clear that during adversarial training, although the Fisher information’s average F-norm also rises with standard accuracy, the values of which is significantly smaller than its counterpart during nature training. That is to say, the adversarial training process could effectively constraint the model from using Fisher information in boosting performance.
Our experiments also demonstrate the widely known but little understood fact that the standard accuracy of models under the two adversarial training algorithms are significantly lower than their counterpart under nature training. According to our theory, it is because they are unable to effectively rely on adversary-prone non-robust components such as Fisher information to distinguish the input data from data belonging to other categories. Therefore, we theoretically and empirically demonstrate the relationship between standard accuracy and adversarial robustness using the disentanglement proposed.
3.4 Explanation From Geometry Point of View
We provide some explanations on the above findings from the information geometry viewpoint. We note that along with the training process, the model is doing maximum likelihood estimation approximately by learning to fit the label distribution over the training data. It thus can be viewed as a process that the log-likelihood landscape of the model on training data is gradually transiting into a state where the model could distinguish data of different categories well. On the other hand, the training data is very sparsely sampled from the whole distribution. Therefore, during the formation of the model’s log-likelihood landscape, smoothness prior does not hold. Without such property, the model could easily use lower order local geometric descriptor such as Fisher information—the local curvature of the log-likelihood landscape—to form an overly simplified log-likelihood landscape that is adversary-prone due to lacking of smoothness. When applying adversarial training, a strong smoothness property is enforced and the model would have to rely on higher-order global geometric descriptor that vanishes locally to form the whole landscape. Thus the landscape could be more robust to adversarial samples.
4 Towards Simultaneous Good Performance and Robustness
With the disentanglement introduced above, it is natural to think whether it is possible to achieve decent standard accuracy and adversarial robustness simultaneously. From our results, if relying on the robust component alone can effectively distinguish data of different categories, obtaining an adversarial robust model with high standard accuracy is possible.
On the other hand, the expansion terms in the robust component are all higher order terms. Therefore, we suggest that the key to achieve the two desired qualities simultaneously is to increase the expressive power of model so that the model would have the ability to utilize the higher order terms for prediction. In this way, the model wouldn’t have to rely heavily on Fisher information while still have decent standard performance with higher order terms (the robust component). Our disentanglement provides theoretical justification about the importance of model complexity in achieving adversarial robust and relatively decent standard performance.
We also designed experiments on CIFAR-10 to provide insights into this possible strategy of achieving the two objective simultaneously. We train the following models with TRADES algorithm [32]: a pure linear VGG13 model without ReLU and with average pooling, a half linear VGG13 model without ReLU but using max pooling, a normal VGG 13 model, a normal resnet20 model and a resnet32 model with 10 more channels. We evaluate these model on standard samples and adversarial samples produced by PGD attack [17] and CW attack [3]. The results and experimental details are provided in Table LABEL:adv-train-ablation.
We first compare VGG models. Pure linear VGG13 model is not complex enough to exploit higher order information for decision making during adversarial training. Thus, it cannot effectively leverage the lower order terms for prediction and achieves very poor accuracy on both standard and adversarial samples. However, as the non-linearity of the model increase (the half linear and normal VGG13 models), the performance on standard and adversarial samples improves simultaneously. For resnet models, when the model is shallower and narrower (resnet20x1), the performance in both standard and adversarial settings are relatively low. However, with a deeper and wider resnet model (resnet32x10), the model could explore more higher order information for prediction, so the performance on standard samples and adversarial samples increases significantly.
5 Discussion
When is NOT sufficiently small
Empirically, researchers have found that the local linearity assumption doesn’t hold when allowed norm of attack noise is relatively larger [13, 17]. In this case, our analysis is not so precise since higher order terms might contribute a lot in equation (10). However, a higher order analogue of (10) and Fisher information could be available and explain the generalization behavior and adversarial robustness. For more details on this issue, we refer the reader to section B of appendix.
Future directions
As we discussed before, our work has very strong geometry insight. Therefore, we suggest that it is possible that analyzing the geometric properties of the log-likelihood landscape formed by the DNNs on input data could provide us with even more interesting insights. Also, some recent literature [30, 16, 5, 31] propose to view the deep learning model as a non-linear dynamic system and study it from the control point of view. The reachable state-space and the stability of the dynamic system are both widely studied in control theory, and we think that is corresponded to the performance and robustness to input perturbation in deep learning system. Also, when the reachable state-space of the dynamic system is large, it might be highly sensitive in certain directions in the state-space, which could also lead to the chaotic behavior of the dynamic system. This kind of trade off is very similar to the one that we discussed in our work. Therefore, we think it might also be possible to characterize the relationship between standard accuracy and adversarial behavior of deep learning models from dynamic system point of view.
6 Conclusion
In this work, we provide a novel view point on standard accuracy and adversarial robustness of deep learning model and show that the overall performance objective could be disentangled into a non-robust component, which is adversary-prone as well as a robust component, which is unrelated to adversarial behavior. In this way, we theoretically explain the relationship between standard accuracy and adversarial robustness: the cost of being adversarial robust is that the model could no longer effectively rely on the non-robust component to distinguish input data from different categories, which means that there is indeed a trade off between these two objectives. However, these two objectives might not be completely contradictory to each other, as there might exist a perfect balance point where the robust component of the model could perfectly distinguish input data from different categories while the non-robust component is not large enough to cause severe adversarial behavior. We also discussed the scenario where the norm of the allowed perturbation is not sufficiently small and higher order terms are needed in the expansion of adversarial objective and showed that our theory still holds. We’re confident that more interesting theory about standard accuracy and adversarial robustness could be develop in the future based on our theory.
Appendix A More Experiments on the Role of Fisher Information
We conduct more visualization experiments about the role of Fisher information in standard performance of DNN. The results are shown in Figure 5 and Figure 6. The experiments are conducted on a resnet20 model. The same conclusion could be drawn according to our statistics.
Appendix B Adversarial behavior in large perturbation region
We now further discuss the scenario where the norm of adversarial noise is too large that the Fisher information alone is not enough to characterize the adversarial behavior of the model. However, since is still a small value, the expansion terms should still vanish at certain order , where is an integer related to . Therefore, for two data-label pairs and , where , the adversarial objective around should be rewritten as:
[TABLE]
We then rewrite the performance objective for better view:
[TABLE]
Here, we see that the adversarial behavior of the model around is not solely related to the spectral norm of any more—it is also related to the norm of the other multi-linear functional that yield the other terms. Similar to the derivation before, under the attack scale of , if the model rely heavily on the first to distinguish input data from data belonging to other categories, then the norm of the first multi-linear functional in the expansion would have to increase drastically, and thus lead to more severe adversarial behavior around according to (12).
Therefore, we show that even under the scenario where is not sufficiently small, the variant of our disentanglement could still clearly explain the relationship between achieving high standard accuracy and staying adversarial robust.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless Fowlkes, Stefano Soatto, and Pietro Perona. Task 2vec: Task embedding for meta-learning. ar Xiv preprint ar Xiv:1902.03545 , 2019.
- 2[2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems , pages 3981–3989, 2016.
- 3[3] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP) , pages 39–57. IEEE, 2017.
- 4[4] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 532–547, 2018.
- 5[5] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems , pages 6571–6583, 2018.
- 6[6] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In Advances in Neural Information Processing Systems , pages 2071–2079, 2015.
- 7[7] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572 , 2014.
- 8[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision , pages 1026–1034, 2015.
