Text-Guided Attention is All You Need for Zero-Shot Robustness in   Vision-Language Models

Lu Yu; Haiyang Zhang; Changsheng Xu

arXiv:2410.21802·cs.CV·October 31, 2024

Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models

Lu Yu, Haiyang Zhang, Changsheng Xu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces TGA-ZSR, a novel framework that improves the adversarial robustness of vision-language models like CLIP by aligning and constraining text-guided attention, resulting in significant accuracy gains across multiple datasets.

Contribution

The paper proposes a new attention-based method, TGA-ZSR, that enhances zero-shot robustness of CLIP models by aligning and constraining text-guided attention with original clean examples.

Findings

01

Achieves 9.58% improvement in zero-shot robust accuracy.

02

Validates effectiveness across 16 datasets.

03

Introduces attention refinement and model constraint modules.

Abstract

Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g. CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: the Attention Refinement module and the Attention-based Model Constraint module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Attention Refinement module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhyblue424/tga-zsr
pytorchOfficial

Videos

Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Adversarial Robustness in Machine Learning

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training