WebSuite: Systematically Evaluating Why Web Agents Fail

Eric Li; Jim Waldo

arXiv:2406.01623·cs.SE·June 5, 2024·1 cites

WebSuite: Systematically Evaluating Why Web Agents Fail

Eric Li, Jim Waldo

PDF

Open Access 1 Repo

TL;DR

WebSuite is a diagnostic benchmark that systematically evaluates why web agents fail by breaking down their actions, enabling targeted improvements and revealing specific weaknesses in different agent types.

Contribution

The paper introduces WebSuite, the first extensible benchmark with a taxonomy of web actions to diagnose failure points in generalist web agents.

Findings

01

Identified distinct failure patterns in web agents.

02

Demonstrated the utility of WebSuite in pinpointing weaknesses.

03

Provided insights for targeted agent improvements.

Abstract

We describe WebSuite, the first diagnostic benchmark for generalist web agents, designed to systematically evaluate why agents fail. Advances in AI have led to the rise of numerous web agents that autonomously operate a browser to complete tasks. However, most existing benchmarks focus on strictly measuring whether an agent can or cannot complete a task, without giving insight on why. In this paper, we 1) develop a taxonomy of web actions to facilitate identifying common failure patterns, and 2) create an extensible benchmark suite to assess agents' performance on our taxonomized actions. This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart, and is designed such that any failure of a task can be attributed directly to a failure of a specific web action. We evaluate two popular generalist web agents, one…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

erichli1/websuite
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Advanced Malware Detection Techniques · Spam and Phishing Detection

MethodsFocus