Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein

TL;DR
This paper introduces an unsupervised adversarial fine-tuning method to enhance the robustness of CLIP vision encoders, significantly improving resistance to attacks in large vision-language models without retraining downstream tasks.
Contribution
It presents a novel unsupervised adversarial fine-tuning approach to make CLIP vision encoders robust against adversarial attacks, applicable across various downstream tasks.
Findings
Robust CLIP models resist adversarial attacks effectively.
Stealth attacks become infeasible with the robust CLIP.
No retraining needed for downstream vision-language models.
Abstract
Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many large vision-language models (LVLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (LVLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of LVLMs by a malicious third party providing manipulated images…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗chs20/tecoa4-clipmodel· 835 dl· ♡ 1835 dl♡ 1
- 🤗chs20/fare4-clipmodel· 1.6k dl· ♡ 11.6k dl♡ 1
- 🤗chs20/fare2-clipmodel· 871 dl· ♡ 2871 dl♡ 2
- 🤗chs20/tecoa2-clipmodel· 814 dl· ♡ 1814 dl♡ 1
- 🤗chs20/FARE4-ViT-B-16-laion2B-s34B-b88Kmodel· 12 dl12 dl
- 🤗chs20/FARE4-ViT-B-32-laion2B-s34B-b79Kmodel· 17 dl17 dl
- 🤗chs20/FARE4-convnext_base_w-laion2B-s13B-b82K-augregmodel· 13 dl13 dl
- 🤗chs20/dino-vitb16-fare4model· 1 dl1 dl
- 🤗chs20/dinov2-base-fare4model· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Dropout · Residual Connection
