Toward a Principled Framework for Agent Safety Measurement
Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan

TL;DR
This paper proposes a search-based framework called BOA for more comprehensive and reliable safety evaluation of LLM agents, capturing rare unsafe behaviors missed by traditional sampling methods.
Contribution
It introduces BOA, a search-based safety measurement framework that explores trajectory space within a likelihood budget, improving detection of unsafe behaviors.
Findings
BOA discovers unsafe trajectories missed by greedy and sampled evaluations.
BOA enables ranking models, defenses, and attacks on the same scale.
BOA is practical with batched decoding, prefix caching, and chunked tree expansion.
Abstract
LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent-safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate -- blind to the long-tail trajectories where unsafe behavior may arise from low-probability but non-negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in-budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent-environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent-safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
