Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Elias Abad Rocamora; Christian Schlarmann; Naman Deep Singh; Yongtao Wu; Matthias Hein; Volkan Cevher

arXiv:2506.03355·cs.LG·October 13, 2025

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher

PDF

Open Access 4 Models 1 Video

TL;DR

This paper introduces LEAF, an adversarial finetuning method that enhances the robustness of CLIP's text encoder, improving performance in adversarial settings without sacrificing vision capabilities.

Contribution

LEAF is a scalable adversarial finetuning approach that significantly improves CLIP's text encoder robustness across multiple tasks while maintaining existing vision model performance.

Findings

01

Improved zero-shot adversarial accuracy in text domain

02

Enhanced multimodal retrieval recall under adversarial noise

03

Better input text reconstruction from embeddings

Abstract

Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Robustness in Both Domains: CLIP Needs a Robust Text Encoder· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsDiffusion · Contrastive Language-Image Pre-training