Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models
Futa Waseda, Saku Sugawara, Isao Echizen

TL;DR
This paper introduces QT-AFT, a novel adversarial fine-tuning method that uses high-quality captions to improve the robustness of vision-language models against adversarial attacks, outperforming prior approaches.
Contribution
The paper proposes QT-AFT, leveraging high-quality captions during training to enhance visual robustness and address limitations of existing adversarial training methods.
Findings
QT-AFT achieves state-of-the-art robustness on 16 datasets.
Describing object properties alongside names improves zero-shot robustness.
Language guidance significantly enhances visual adversarial robustness.
Abstract
Defending pre-trained vision-language models (VLMs), such as CLIP, against adversarial attacks is crucial, as these models are widely used in diverse zero-shot tasks, including image classification. However, existing adversarial training (AT) methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness. Specifically, (1) supervised AT methods rely on short texts (e.g., class labels) to generate adversarial perturbations, leading to overfitting to object classes in the training data, and (2) unsupervised AT avoids this overfitting but remains suboptimal against practical text-guided adversarial attacks due to its lack of semantic guidance. To address these limitations, we propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Warnings and Signage
