WebSuite: Systematically Evaluating Why Web Agents Fail
Eric Li, Jim Waldo

TL;DR
WebSuite is a diagnostic benchmark that systematically evaluates why web agents fail by breaking down their actions, enabling targeted improvements and revealing specific weaknesses in different agent types.
Contribution
The paper introduces WebSuite, the first extensible benchmark with a taxonomy of web actions to diagnose failure points in generalist web agents.
Findings
Identified distinct failure patterns in web agents.
Demonstrated the utility of WebSuite in pinpointing weaknesses.
Provided insights for targeted agent improvements.
Abstract
We describe WebSuite, the first diagnostic benchmark for generalist web agents, designed to systematically evaluate why agents fail. Advances in AI have led to the rise of numerous web agents that autonomously operate a browser to complete tasks. However, most existing benchmarks focus on strictly measuring whether an agent can or cannot complete a task, without giving insight on why. In this paper, we 1) develop a taxonomy of web actions to facilitate identifying common failure patterns, and 2) create an extensible benchmark suite to assess agents' performance on our taxonomized actions. This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart, and is designed such that any failure of a task can be attributed directly to a failure of a specific web action. We evaluate two popular generalist web agents, one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Advanced Malware Detection Techniques · Spam and Phishing Detection
MethodsFocus
