Why Does Agentic Safety Fail to Generalize Across Tasks?
Yonatan Slutzky, Yotam Alexander, Tomer Slor, Yoav Nagel, Nadav Cohen

TL;DR
This paper investigates why safety in AI agents fails to generalize across unseen tasks, revealing that safety requirements inherently complicate task execution and suggesting new directions for ensuring safety.
Contribution
The paper provides theoretical analysis and empirical evidence showing safety's complex relationship with task generalization, highlighting limitations of current safety approaches.
Findings
Safety mapping has higher Lipschitz constant with safety constraints
Empirical tests in quadcopter and CRM demonstrate safety generalization issues
Abstract
AI agents are increasingly deployed in multi-task settings, where the task to perform is specified at test time, and the agent must generalize to unseen tasks. A major concern in such settings is safety: often, an agent must not only execute unseen tasks, but do so while avoiding risks and handling ones that materialize. Empirical evidence suggests that even when the ability to execute generalizes to unseen tasks, the ability to do so safely frequently does not. This paper provides theory and experiments indicating that failures of agentic safety to generalize across tasks are not merely due to limitations of training methods, but reflect an inherent property of safety itself: the relationship between a task and its safe execution is more complex than the relationship between a task and its execution alone. Theoretically, we analyze linear-quadratic control with -robustness,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
