TL;DR
This paper introduces DeepTrap, a framework for detecting vulnerabilities in agentic language models by manipulating execution contexts, revealing security risks beyond user prompts.
Contribution
It presents a novel black-box optimization approach to identify context-based vulnerabilities in AI agents, supported by a comprehensive benchmark and evaluation.
Findings
Contextual compromise can cause unsafe behaviors without affecting task performance.
Final-response evaluation alone is insufficient for security assessment.
DeepTrap effectively uncovers high-risk vulnerabilities across multiple scenarios.
Abstract
Agentic language-model systems increasingly rely on mutable execution contexts, including files, memory, tools, skills, and auxiliary artifacts, creating security risks beyond explicit user prompts. This paper presents DeepTrap, an automated framework for discovering contextual vulnerabilities in OpenClaw. DeepTrap formulates adversarial context manipulation as a black-box trajectory-level optimization problem that balances risk realization, benign-task preservation, and stealth. It combines risk-conditioned evaluation, multi-objective trajectory scoring, reward-guided beam search, and reflection-based deep probing to identify high-value compromised contexts. We construct a 42-case benchmark spanning six vulnerability classes and seven operational scenarios, and evaluate nine target models using attack and utility grading scores. Results show that contextual compromise can induce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
