Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings   for Robust Large Vision-Language Models

Christian Schlarmann; Naman Deep Singh; Francesco Croce; Matthias Hein

arXiv:2402.12336·cs.LG·June 6, 2024·6 cites

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein

PDF

Open Access 1 Repo 9 Models

TL;DR

This paper introduces an unsupervised adversarial fine-tuning method to enhance the robustness of CLIP vision encoders, significantly improving resistance to attacks in large vision-language models without retraining downstream tasks.

Contribution

It presents a novel unsupervised adversarial fine-tuning approach to make CLIP vision encoders robust against adversarial attacks, applicable across various downstream tasks.

Findings

01

Robust CLIP models resist adversarial attacks effectively.

02

Stealth attacks become infeasible with the robust CLIP.

03

No retraining needed for downstream vision-language models.

Abstract

Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many large vision-language models (LVLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (LVLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of LVLMs by a malicious third party providing manipulated images…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chs20/robustvlm
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Dropout · Residual Connection