TL;DR
This paper introduces a safety framework for AI agents using a capability-safe programming language, enabling static safety guarantees and preventing unsafe behaviors with minimal performance impact.
Contribution
It proposes a novel safety harness using Scala 3's type system to enforce capability-based access control for AI agents.
Findings
Agents can generate capability-safe code without significant performance loss.
The type system effectively prevents information leakage and malicious side effects.
Extensible safety harnesses can be built using a strong type system with tracked capabilities.
Abstract
AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based "safety harness": instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala's type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
