AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Ori Yoran; Samuel Joseph Amouyal; Chaitanya Malaviya; Ben; Bogin; Ofir Press; Jonathan Berant

arXiv:2407.15711·cs.CL·October 22, 2024·1 cites

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben, Bogin, Ofir Press, Jonathan Berant

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces AssistantBench, a new benchmark for evaluating web agents on realistic, time-consuming tasks, revealing current limitations and proposing a new agent, SeePlanAct, that outperforms previous models.

Contribution

The paper presents AssistantBench, a comprehensive benchmark for web agents, and introduces SeePlanAct, a novel web agent that significantly improves performance on complex tasks.

Findings

01

Current models score below 26 on AssistantBench.

02

Closed-book LMs have low precision and hallucinate facts.

03

State-of-the-art web agents perform near zero on the benchmark.

Abstract

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 26 points. While closed-book LMs perform well in terms of accuracy, they exhibit low precision and tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AssistantBench/AssistantBench
dataset· 11k dl
11k dl

Videos

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?· underline

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies · Peer-to-Peer Network Technologies