Toward a Principled Framework for Agent Safety Measurement

Shuyi Lin; Anshuman Suri; Alina Oprea; Cheng Tan

arXiv:2605.01644·cs.CR·May 5, 2026

Toward a Principled Framework for Agent Safety Measurement

Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan

PDF

TL;DR

This paper proposes a search-based framework called BOA for more comprehensive and reliable safety evaluation of LLM agents, capturing rare unsafe behaviors missed by traditional sampling methods.

Contribution

It introduces BOA, a search-based safety measurement framework that explores trajectory space within a likelihood budget, improving detection of unsafe behaviors.

Findings

01

BOA discovers unsafe trajectories missed by greedy and sampled evaluations.

02

BOA enables ranking models, defenses, and attacks on the same scale.

03

BOA is practical with batched decoding, prefix caching, and chunked tree expansion.

Abstract

LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent-safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate -- blind to the long-tail trajectories where unsafe behavior may arise from low-probability but non-negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in-budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent-environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent-safety…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.