VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models

Hefei Mei; Zirui Wang; Shen You; Minjing Dong; Chang Xu

arXiv:2505.17440·cs.CV·February 5, 2026

VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models

Hefei Mei, Zirui Wang, Shen You, Minjing Dong, Chang Xu

PDF

1 Repo 3 Reviews

TL;DR

VEAttack is a simple, downstream-agnostic adversarial attack targeting the vision encoder of LVLMs, significantly degrading performance across multiple tasks without requiring access to the full model or labels.

Contribution

This work introduces VEAttack, a novel vision encoder attack that reduces computational cost and task dependence, applicable across various LVLM tasks.

Findings

01

Achieves 94.5% performance degradation on image captioning

02

Achieves 75.7% performance degradation on visual question answering

03

Generalizes effectively to multiple downstream tasks

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation, yet their vulnerability to adversarial attacks raises significant robustness concerns. While existing effective attacks always focus on task-specific white-box settings, these approaches are limited in the context of LVLMs, which are designed for diverse downstream tasks and require expensive full-model gradient computations. Motivated by the pivotal role and wide adoption of the vision encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack (VEAttack), which targets the vision encoder of LVLMs only. Specifically, we propose to generate adversarial examples by minimizing the cosine similarity between the clean and perturbed visual features, without accessing the following large language models, task information, and labels. It significantly reduces…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. Innovative Attack Setting: focus on the vision encoder in a gray-box setting.

Weaknesses

1. Lack of citations and comparisons with papers highly similar to this paper. - An Image Is Worth 1000 Lies: Adversarial Transferability across Prompts on Vision-Language Models, ICLR 2024. - QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models, NAACL 2025. - InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models, ARXIV. 2. Without Any New Insights: Attacking the vision encoder to achieve attacks on the entire LVLMs is not novel; it is quite intuitive. 3

Reviewer 02Rating 6Confidence 5

Strengths

1. Realistic threat model: Attacks only the shared vision encoder—a genuinely deployable setting for LVLM vulnerabilities. 2. Theoretically principled: Clear justification that perturbing patch tokens yields stronger downstream disruption than class tokens. 3. Highly transferable: Single perturbation damages multiple tasks (captioning, VQA, hallucination). 4. Efficiency: 8–13× faster than prior multi-step attacks, with small ε (2–8/255). 5. Insightful analysis: Reveals internal LLM distortions,

Weaknesses

1. Defense gap: No practical mitigation or robust-training strategy is explored beyond noting cost trade-offs. 2. Limited architecture diversity: Focuses mainly on CLIP-based encoders; broader evaluation would strengthen claims. 3. Transfer paradox underexplained: The Möbius effect is intriguing but remains a descriptive observation, not a mechanistic analysis. 4. Ethical discussion minimal: Needs clearer guidance on responsible release and safety implications. 5. The paper closely overlaps with

Reviewer 03Rating 8Confidence 4

Strengths

The motivation is clear, and the introduction effectively conveys the idea. VEAttack provides a solid and effective paradigm for gray-box adversarial attacks on LVLMs, offering a detailed analysis and feasibility assessment for this approach. The effectiveness and efficiency are well demonstrated across several datasets and models.

Weaknesses

(1) Table 9 shows the attack performance of the Image-Text Retrieval task, which complements the tasks. However, another focus of these works [1, 2] is on transfer attacks between vision encoders, like ALBEF and CLIP-CNN, and it is recommended to include more demonstrations of this performance. (2) Eq. (5) gives two baselines, but seems to lack the comparison of the second L2 Attack. (3) Based on observation 4, you perform a time comparison. However, I notice that the used step is 50 instead o

Code & Models

Repositories

hfmei/veattack-lvlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Focus