WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
Tri Cao, Yulin Chen, Hieu Cao, Yibo Li, Khoi Le, Thong Nguyen, Yuexin Li, Yufei He, Yue Liu, Shuicheng Yan, Bryan Hooi

TL;DR
This paper introduces WARD, a robust guard model for web agents that defends against prompt injection attacks, using large datasets and adaptive adversarial training to ensure high accuracy and efficiency.
Contribution
The paper presents WARD, a novel guard model built on extensive datasets and an adaptive training framework, improving robustness and efficiency against evolving prompt injection attacks.
Findings
WARD achieves nearly perfect recall on out-of-distribution benchmarks.
Maintains low false positive rates to preserve agent utility.
Remains robust against adaptive attacks under distribution shifts.
Abstract
Web agents can autonomously complete online tasks by interacting with websites, but their exposure to open web environments makes them vulnerable to prompt injection attacks embedded in HTML content or visual interfaces. Existing guard models still suffer from limited generalization to unseen domains and attack patterns, high false positive rates on benign content, reduced deployment efficiency due to added latency at each step, and vulnerability to adversarial attacks that evolve over time or directly target the guard itself. To address these limitations, we propose WARD (Web Agent Robust Defense against Prompt Injection), a practical guard model for secure and efficient web agents. WARD is built on WARD-Base, a large-scale dataset with around 177K samples collected from 719 high-traffic URLs and platforms, and WARD-PIG, a dedicated dataset designed for prompt injection attacks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
