Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Peng-Fei Zhang; Zi Huang

arXiv:2601.10313·cs.CV·February 18, 2026

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Peng-Fei Zhang, Zi Huang

PDF

Open Access

TL;DR

This paper introduces Hierarchical Refinement Attack (HRA), a novel universal multimodal attack framework on vision-language models that improves transferability and efficiency by hierarchically refining perturbations in both image and text modalities.

Contribution

The paper presents a new hierarchical approach to universal multimodal attacks on VLP models, reducing computational costs and enhancing transferability across tasks and datasets.

Findings

01

HRA achieves superior transferability compared to existing methods.

02

HRA effectively refines perturbations in both image and text modalities.

03

Extensive experiments validate the robustness of the proposed attack.

Abstract

Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. For the image modality, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, it hierarchically models textual importance by considering both intra- and inter-sentence contributions to identify globally influential words, which are then used as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets, demonstrate the superior transferability of the proposed universal multimodal attacks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Topic Modeling