Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

Yuanbo Li; Tianyang Xu; Cong Hu; Tao Zhou; Xiao-Jun Wu; Josef Kittler

arXiv:2603.04839·cs.CV·March 24, 2026

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu, Josef Kittler

PDF

Open Access

TL;DR

This paper introduces SADCA, a novel attack method that enhances the transferability of adversarial examples in vision-language models by using semantic augmentation and dynamic contrastive interactions, leading to more effective cross-model attacks.

Contribution

The paper proposes SADCA, a semantic-augmented dynamic contrastive attack that improves transferability of adversarial examples in vision-language pre-training models through progressive, semantically guided perturbations.

Findings

01

SADCA outperforms existing attack methods in transferability across multiple datasets and models.

02

Semantic augmentation increases diversity and generalization of adversarial examples.

03

Dynamic contrastive interactions reinforce semantic inconsistency, enhancing attack effectiveness.

Abstract

With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis