ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov

TL;DR
This paper introduces ST-WebAgentBench, a comprehensive benchmark suite for evaluating the safety and trustworthiness of web agents in enterprise scenarios, addressing gaps in existing task success metrics.
Contribution
It presents a configurable, extensible benchmark with new metrics and policies to assess safety and trustworthiness of web agents beyond mere task completion.
Findings
State-of-the-art agents often breach safety policies
CuP metric effectively measures policy-compliant completions
Benchmark exposes safety gaps in current web agents
Abstract
Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and easily extensible suite for evaluating web agent ST across realistic enterprise scenarios. Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions. Evaluating…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper addresses a genuine gap in web agent evaluation. Current benchmarks measure only task success, ignoring whether agents complete tasks safely or within policy constraints. This matters for real-world deployment where unsafe successes can cause serious harm. 2. The hierarchical policy framework (organizational > user > task) is well-motivated and reflects real enterprise governance structures. The formalization provides a principled approach to reasoning about conflicting constraints.
1. Only 222 tasks across three applications (GitLab, ShoppingAdmin, SuiteCRM) in English cannot support claims about "enterprise readiness" or comprehensive safety evaluation. Where are tasks for email, document collaboration, financial systems, HR platforms, or communication tools? The authors position this as "the first benchmark" and standard for enterprise deployment, but it covers only a narrow slice of enterprise workflows. The generalization claims far exceed what this limited set can jus
1. The paper addresses a critical gap in web agent evaluation. Current benchmarks only measure whether agents finish tasks, completely ignoring safety aspect. The paper makes a convincing case that this is a serious problem for actual enterprise deployment. 2. The scalability study (policy‑load vs. CuP) convincingly demonstrates that current agents do not gracefully handle even modest policy stacks, highlighting a concrete research bottleneck. 3. Table 1 gives a clear, simple comparison of the b
1. The evaluation is limited to three SOTA open‑source agents; there is no comparison with the latest closed-source models baselines (e.g., GPT, Claude, Gemini), so the behavior of frontier models for policy-prompting remains untested. 2. Only 3 applications and 222 tasks may not capture the full heterogeneity of enterprise web workflows (e.g., finance, health, legacy systems etc.). The authors acknowledge this in the paper but don't really discuss implications. 3. The current “POLICY_CONTEXT” s
The paper makes a strong contribution by introducing a benchmark that evaluates web agents not only on task completion but also on safety and trustworthiness. I appreciate the clear definition of six policy dimensions and the formal hierarchy that governs them, which provides a structured way to reason about permissible actions. The metrics such as Completion-under-Policy and Risk Ratio are well-motivated and transform qualitative policy adherence into quantitative measures. Integrating these ch
I have several concerns. The benchmark’s 222 tasks, while enterprise-relevant, are restricted to three applications and English language, potentially limiting generalizability to broader workflows and multilingual settings. Another concern is the skew in policy distribution, which may bias results toward certain violation types while underrepresenting others like error handling. The reliance on prompt-level policy injection could conflate adherence with prompt compliance, missing violations that
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAccess Control and Trust · Network Security and Intrusion Detection · Advanced Malware Detection Techniques
