TRAP: Targeted Redirecting of Agentic Preferences
Hangoo Kang, Jehyeok Yeon, Gagandeep Singh

TL;DR
TRAP introduces a diffusion-based semantic injection framework that subtly manipulates vision-language models, exposing vulnerabilities in autonomous agent decision-making without requiring internal model access.
Contribution
It presents a novel, model-agnostic adversarial method that exploits semantic vulnerabilities in cross-modal AI systems using naturalistic images.
Findings
TRAP effectively redirects agent preferences across multiple models.
It outperforms existing adversarial baselines in inducing decision biases.
The method reveals a significant vulnerability in current vision-language AI systems.
Abstract
Autonomous agentic AI systems powered by vision-language models (VLMs) are rapidly advancing toward real-world deployment, yet their cross-modal reasoning capabilities introduce new attack surfaces for adversarial manipulation that exploit semantic reasoning across modalities. Existing adversarial attacks typically rely on visible pixel perturbations or require privileged model or environment access, making them impractical for stealthy, real-world exploitation. We introduce TRAP, a novel generative adversarial framework that manipulates the agent's decision-making using diffusion-based semantic injections into the vision-language embedding space. Our method combines negative prompt-based degradation with positive semantic optimization, guided by a Siamese semantic network and layout-aware spatial masking. Without requiring access to model internals, TRAP produces visually natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Ethics and Social Impacts of AI
MethodsDiffusion
