Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Qiwei Tian; Chenhao Lin; Zhengyu Zhao; Chao Shen

arXiv:2512.07222·cs.LG·April 17, 2026

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen

PDF

1 Repo 1 Video

TL;DR

This paper introduces Function-word De-Attention (FDA), a novel method that reduces the vulnerability of vision-language models to adversarial attacks by de-emphasizing function words in cross-modal attention.

Contribution

The paper proposes FDA, a new attention mechanism that improves robustness of VLMs against adversarial attacks while maintaining performance, validated through extensive experiments.

Findings

01

FDA reduces attack success rate by up to 53% on tested models.

02

FDA maintains high performance with only 0.2-0.6% accuracy drops.

03

FDA demonstrates scalability, generalization, and zero-shot robustness.

Abstract

To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

michaeltian108/FDA
github

Videos

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models· slideslive