BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui, Yu-Xiong Wang, Huan Zhang, Heng Ji, Daniel Kang

TL;DR
This paper introduces BEAT, a novel framework for injecting visual backdoors into VLM-based embodied agents using object triggers, revealing significant security vulnerabilities and proposing a contrastive learning approach to enhance attack success.
Contribution
BEAT is the first method to reliably implant visual backdoors in VLM-based embodied agents with object triggers, employing a two-stage training scheme including contrastive trigger learning.
Findings
Achieves up to 80% attack success rate across benchmarks.
Boosts backdoor activation accuracy by 39% with limited data.
Maintains strong performance on benign tasks.
Abstract
Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision-driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into VLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is well motivated on a novel backdoor attack targeting embodied agents within MLLM frameworks. The focus on multi-turn behavior further expands upon prior studies. - The paper’s writing and organization are very clear. - The validation is comprehensive across models and benchmarks, showing the generalizability of this approach.
The attack is limited to a relatively simple setup, i.e., a single static trigger per benchmark, without complex actions. The complexity of injecting triggers into the dataset is also relatively expensive and not practical in real-world robotic pipelines. The authors should also clarify below questions: Experiments - ASR is measured based on the final output; however, given the CTL objective, it would be more aligned to see the evaluation around the trigger appearing time (exact frame, or wi
1. This study is a novel problem focus: first systematic study of object-triggered, multi-step backdoor behavior in MLLM-embodied agents, going beyond prior single-turn or textual triggers. 2. Contrastive trigger learning is intuitive and effective for sharpened trigger activation and reduced false positives. 3.Experiments across multiple simulators and models demonstrate strong ASR, preserved benign performance, and OOD generalization.
1. While the backdoor environment under multi-agent is new and novel, the methodology is simple. The object-trigger is similar with Shadowcase (Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models), the two stage training is also standard. The CTL is intuitive but adapted from similar problem. 2. Backdoor trigger remains a visible object; evaluation lacks physical-world noise defenses (e.g., blur, smoothing), which would align more with realistic robotic/security settings.
- **A forward-looking and promising approach**: The threat model and method in this paper are highly novel. As MLLMs are increasingly deployed in robotics and autonomous agents, this type of visual backdoor attack is highly relevant to potential real-world security challenges. This work opens up an important and timely research direction for the field of embodied AI security. - The paper's methodology is easy to follow, and the experimental design is rigorous with a comprehensive evaluation suit
- **Trigger Complexity and Semantic Level**: Currently, the attack is limited to single, predefined objects like a "knife" or a "vase." I think the authors could consider exploring more complex trigger conditions to improve the attack's stealth. For example, the trigger could be a combination of objects ("a knife and a blue cube on the table") or even more abstract scene semantics. While the proposed framework has the potential to be extended to such scenarios, the current experiments do not cov
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
