TL;DR
This paper advocates for symbolic guardrails as a practical method to enhance safety and security guarantees in domain-specific AI agents without compromising their utility.
Contribution
It provides a systematic review of safety benchmarks, analyzes the enforceability of policies by symbolic guardrails, and evaluates their impact on agent safety and success.
Findings
85% of benchmarks lack concrete policies
74% of policy requirements can be enforced by symbolic guardrails
Symbolic guardrails improve safety and security without utility loss
Abstract
AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on -Bench, CAR-bench, and MedAgentBench. We find that 85\% of benchmarks lack concrete policies, relying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
