TL;DR
This paper introduces a dual-axis evaluation framework for universal adversarial attacks on vision-language models, distinguishing between influence and precise injection, revealing a significant gap between perceived disturbance and actual injection success.
Contribution
It proposes a novel dual-axis assessment method for adversarial attacks, combining influence detection with injection accuracy, and provides a comprehensive dataset and analysis of attack effectiveness.
Findings
Most pairs show influence without successful injection.
Zero detectable drift in BLIP-2 at specified perturbation levels.
Significant divergence between influence and injection success rates.
Abstract
Universal adversarial attacks on aligned multimodal large language models are increasingly reported with attack success rates in the 60-80% range, suggesting the visual modality is highly vulnerable to imperceptible perturbations as a prompt-injection channel. We argue that this number conflates two distinct events: (i) the model's output was perturbed (Influence), and (ii) the attacker's chosen target concept was actually emitted (Precise Injection). We compose two existing techniques -- Universal Adversarial Attack and AnyAttack -- under an budget of 16/255, and we add a dual-axis evaluation: a deterministic Ratcliff-Obershelp drift score for Influence (programmatic baseline) plus a 4-tier ordinal categorical none/weak/partial/confirmed for Precise Injection. The judge is DeepSeek-V4-Pro in thinking mode, calibrated against Claude Opus 4.7 with Cohen's = 0.77 on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
