IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang

TL;DR
This paper introduces IAG, a novel input-aware backdoor attack on vision-language models for visual grounding, demonstrating high effectiveness and stealth without degrading normal performance, revealing significant security vulnerabilities.
Contribution
We propose IAG, the first dynamic, input-aware backdoor attack on VLM-based visual grounding that uses text-guided triggers conditioned on target descriptions.
Findings
IAG achieves high attack success rates across multiple models and datasets.
IAG maintains normal grounding accuracy on benign inputs.
The attack is robust against existing defenses and transferable across models.
Abstract
Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
