IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Junxian Li; Beining Xu; Simin Chen; Jiatong Li; Jingdi Lei; Haodong Zhao; Di Zhang

arXiv:2508.09456·cs.CV·March 24, 2026

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang

PDF

TL;DR

This paper introduces IAG, a novel input-aware backdoor attack on vision-language models for visual grounding, demonstrating high effectiveness and stealth without degrading normal performance, revealing significant security vulnerabilities.

Contribution

We propose IAG, the first dynamic, input-aware backdoor attack on VLM-based visual grounding that uses text-guided triggers conditioned on target descriptions.

Findings

01

IAG achieves high attack success rates across multiple models and datasets.

02

IAG maintains normal grounding accuracy on benign inputs.

03

The attack is robust against existing defenses and transferable across models.

Abstract

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.