Rethinking Model Ensemble in Transfer-based Adversarial Attacks
Huanran Chen, Yichi Zhang, Yinpeng Dong, Xiao Yang, Hang Su, Jun Zhu

TL;DR
This paper analyzes why ensemble methods improve transfer-based adversarial attacks and introduces a new attack strategy, CWA, that enhances transferability by targeting model weaknesses related to loss landscape flatness and local optima proximity.
Contribution
It provides a theoretical and empirical analysis of ensemble weaknesses and proposes the Common Weakness Attack (CWA) to generate more transferable adversarial examples.
Findings
CWA improves transferability on image classification tasks.
CWA enhances attack success on object detection models.
Effective against adversarially trained models and real-world systems like Google's Bard.
Abstract
It is widely recognized that deep learning models lack robustness to adversarial examples. An intriguing property of adversarial examples is that they can transfer across different models, which enables black-box attacks without any knowledge of the victim model. An effective strategy to improve the transferability is attacking an ensemble of models. However, previous works simply average the outputs of different models, lacking an in-depth analysis on how and why model ensemble methods can strongly improve the transferability. In this paper, we rethink the ensemble in adversarial attacks and define the common weakness of model ensemble with two properties: 1) the flatness of loss landscape; and 2) the closeness to the local optimum of each model. We empirically and theoretically show that both properties are strongly correlated with the transferability and propose a Common Weakness…
Peer Reviews
Decision·ICLR 2024 poster
1. The paper is well-written in general and has very clear motivation and mathematical formulations. The derivation of attack based on the second order appropriation is intuitive. 2. The author conducts very extensive empirical studies with many choices of architectures and datasets.
1. In table 1, the author only provides the results of CWA in combintation with other methods. Is there results for CWA only, and how well it performs. 2. As an ablation study, the author might want to try different norm decomposition, e.g. the operator norm of H. 3. The author should clarify the novelty compared to the previous methods. Specially, MI, VMI, SSA are all existing proposed attacks. The sharpness aware minimization techniques have been previously used.
- The paper tackles an important subject, understanding how vulnerable can ML models be is crucial for safe-guarding them against possible adversaries - The proposed method, as far as I can tell, seems novel to me. - The strengths of this paper are primarily in the effectiveness of their proposed method, as it seems to combine very well with prior attacks (e.g, MI-CWA, SSA-CWA) and achieves superior results. The transfer-based black-box attacks against Bard is interesting. - The proposed algo
- I felt that the paper was bit hard to follow. Section 3 mixes a lot of prior results with proposed ones. Thus it makes it a bit hard to distinguish original contributions vs reusing prior results. For instance, it would greatly improve the readability if SAM was properly explained prior to this section (or as a subsection). Figures 1 and 2 are not very informative (see Questions), they lack axis and/or labels. It would also improve the quality of the paper if the authors include a mathematical
1. This paper conducts extensive experiments on several datasets, demonstrating that CWA receives superior results than previous methods. 2. The intuition of common weakness is strong and clear. This paper converts the task "crafting adversarial examples on ensemble models" to "optimizing the second term in Equation (2)". By Theorem 3.1, this term is further decomposed into a "flatness term" and a "closeness term". These two terms can be efficiently optimized by SAM and CSE.
1. The intuition behind SAM and CSE is not clear enough. Equation (4) and (5) is confusing for those readers not familiar with this area. 2. In Section 3.1, the authors mentioned that "the goal of transfer-based attacks is to craft an adversarial example $x$ that is misclassified by all models in $\mathcal{F}$". However, fooling all target models seems to be an impossible job in practice. Besides, the experiments in this paper cannot support this claim. 3. The perturbation in Figure 3 is not imp
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
