StressWeb: A Diagnostic Benchmark for Web Agent Robustness under Realistic Interaction Variability
Haoyue Bai, Dong Wang, Long Chen, Bingguang Hao, Pengyang Shao, Yonghui Yang, Yicheng He, Chenyi Zhuang

TL;DR
StressWeb is a benchmark designed to evaluate web agents' robustness by introducing realistic interaction perturbations, revealing failure modes not seen in ideal conditions.
Contribution
We created a stress-testing benchmark with controlled perturbations to systematically diagnose web agent robustness under realistic variability.
Findings
StressWeb exposes robustness gaps in state-of-the-art web agents.
Perturbations cause significant performance drops in existing agents.
Benchmark enables systematic diagnosis of failure modes under interaction variability.
Abstract
Large language model-based web agents have demonstrated strong performance on realistic web interaction tasks. However, existing evaluations are predominantly conducted under relatively stable and well-behaved interaction conditions, which may overestimate agent robustness. High task success in such idealized settings does not necessarily reflect performance under realistic web interaction. To address this limitation, we introduce a diagnostic stress-testing benchmark for web agents. We first construct realistic and controllable web environments that provide clean and stable interaction workflows as reference baselines. We then introduce structured and controlled perturbations that emulate interaction variability, including shifting layouts, altered interaction semantics, and execution disruptions. By comparing agent behavior between clean and perturbed settings, our framework enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
