HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models

Han Liu; Jiaqi Li; Zhi Xu; Xiaotong Zhang; Xiaoming Xu; Fenglong Ma; Yuanman Li; and Hong Yu

arXiv:2604.16499·cs.CV·April 21, 2026

HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models

Han Liu, Jiaqi Li, Zhi Xu, Xiaotong Zhang, Xiaoming Xu, Fenglong Ma, Yuanman Li, and Hong Yu

PDF

1 Video

TL;DR

HQA-VLAttack introduces a novel framework for high-quality black-box adversarial attacks on vision-language models, effectively generating adversarial examples by combining semantic-preserving text perturbations and contrastive learning-based image modifications.

Contribution

It proposes a simple, effective attack method that improves success rates by integrating semantic consistency and contrastive learning, addressing limitations of prior complex or less effective approaches.

Findings

01

Outperforms strong baselines in attack success rate on benchmark datasets.

02

Utilizes contrastive learning to optimize image adversarial perturbations.

03

Ensures semantic consistency in text perturbations using counter-fitting word vectors.

Abstract

Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models· slideslive