A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang, Guancheng Wang, and JianFeng Ma

TL;DR
This paper introduces CrossMPI, a novel attack that uses image-only perturbations to influence both text and image interpretations in large vision-language models, revealing new vulnerabilities.
Contribution
The authors propose a cross-modal prompt injection attack leveraging model hidden states and novel optimization strategies, advancing the understanding of multimodal attack surfaces.
Findings
CrossMPI outperforms baseline methods across multiple LVLMs and datasets.
Optimal attack layers are located in the middle of the model, not the last.
The attack effectively steers model interpretation of both text and images.
Abstract
Large vision-language models (LVLMs) have emerged as a powerful paradigm for multimodal intelligence, but their growing deployment also expands the attack surface of prompt injection. Despite this growing concern, existing attacks still suffer from a critical limitation: the injected prompt for one modality only steers the model's interpretation of that singular input. Alternatively, these attacks remain multimodal but fail to achieve cross-modal prompt perturbation. To bridge this gap, we introduce a novel cross-modal prompt injection attack CrossMPI, which can steer the model's interpretation of both textual and visual inputs via image-only prompt injection. Our design is underpinned by the following key breakthroughs. First, we turn the focus of the injected prompt perturbation optimization from the visual embedding space (typically with only parameters) to the model hidden…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
