WebGuard: Building a Generalizable Guardrail for Web Agents
Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su

TL;DR
WebGuard introduces a comprehensive dataset and evaluation framework for assessing and improving the safety of web agents by predicting the outcomes of their actions across diverse online environments.
Contribution
It presents the first dataset for web agent safety assessment, a novel risk schema, and demonstrates how fine-tuning models significantly improves action outcome prediction and risk detection.
Findings
Fine-tuned models improve accuracy from 37% to 80%.
High-risk action recall increases from 20% to 76%.
Current models still need better reliability for high-stakes deployment.
Abstract
The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. **Targeted an Important and Under-Studied Problem:** The paper correctly identifies the gap in agent safety. It shifts the focus from task-level policy compliance (like in `ST-WebAgentBench`) or malicious-intent red-teaming (like in `BrowserART`) to the more granular, inherent risk of *atomic actions*. This is a timely and promising research direction. 2. **Valuable Dataset Artifact:** The `WebGuard` dataset is a substantial contribution. With 4,939 annotated actions across 193 diverse websit
1. **Contribution is Heavily Engineering-Focused:** The paper's core contribution is a dataset, and it lacks the methodological novelty typically expected at ICLR. - The conceptual framework of an "action-impact" dataset is not new; "Interaction to Impact" recently established this paradigm for mobile UIs. This work is a valuable *port* to the web domain but not a new research *concept*. - The proposed $SAFE/LOW/HIGH$ risk schema is a standard, pragmatic simplification used in operational
1. The paper introduces what the authors claim is the first comprehensive action-level dataset for web agent guardrails, which uniquely includes 4,939 annotations from real-world sources and covers even the often-overlooked long-tail websites. 2. The three-stage data curation process (website selection, action collection, and a multi-reviewer annotation review) indicates that the dataset is likely to be high-quality and reliable. 3. The paper's experimental setup is comprehensive and well-design
1. In the paper the long-tail split has only 143 actions from 15 websites. In my opinion, the split is too small for a reliable evaluation of generalization to underrepresented sites. 2. No inter-annotator agreement metrics were reported, raising questions about annotation reliability and schema clarity. How were disagreements resolved? 3. The current method of manually labelling data works for now, but it will be a bottleneck for creating the much larger datasets required in the future. Can aut
- The proposed dataset was collected from a diverse range of websites, and through cross-verified annotation, the authors evaluated each action not simply as risky or not risky but according to three levels of risk. - In the experimental setup, rather than using a simple train/validation split, the evaluation is carefully designed to assess how robustly the fine-tuned model performs on out-of-distribution (OOD) data, such as long-tail and cross-domain scenarios. - The authors made efforts to i
- This dataset need to be collected manually by humans, as it relies heavily on human decision-making. - Although the authors attempted to reduce variance through cross-checking among multiple human experts, the "intent" and "outcome" of web actions cannot be automatically determined, and assessing whether an action is reversible requires subjective, experience-based judgment, which can lead to inconsistent labeling across individuals. - In reality, the fine-tuned VLM may not have developed
1. Tackles a timely, important, and practical problem at the action-level. Having reliable guardrails and understanding of the impact of actions is a real bottleneck for deploying web agents in the wild safely. 2. The WebGuard dataset is a contribution. Its value comes from its scale, use of real-world websites (not simulations), domain diversity, inclusion of long-tail sites, and a well-defined annotation schema. 3. The use of four distinct generalization test splits (Long-Tail, Cross-Domain, C
1. The paper's core contribution is a dataset and benchmark. The "guardrail construction" uses standard prompting and supervised fine-tuning with no big methodological innovation. 2. I see two concerns with the dataset methodology. First, there is no reported inter-annotator agreement (IAA): The paper reports no IAA metrics (e.g., Fleiss' Kappa), which is a standard requirement for verifying the robustness and lack of ambiguity in a new annotation schema. Second, "Default SAFE" Labeling: The str
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
