TL;DR
This paper introduces AdvFLYP, a simple adversarial finetuning paradigm for vision-language models like CLIP, improving zero-shot adversarial robustness by aligning adversarial images with text and regularizing features, outperforming existing methods.
Contribution
AdvFLYP leverages CLIP's pretraining process for adversarial finetuning on web-collected image-text pairs, enhancing robustness and transferability across diverse datasets.
Findings
AdvFLYP outperforms mainstream practices on 14 downstream datasets.
Logit- and feature-level regularizations improve robustness and clean accuracy.
Regularization stabilizes adversarial image embeddings of noisy web images.
Abstract
Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP's pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
