AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben, Bogin, Ofir Press, Jonathan Berant

TL;DR
This paper introduces AssistantBench, a new benchmark for evaluating web agents on realistic, time-consuming tasks, revealing current limitations and proposing a new agent, SeePlanAct, that outperforms previous models.
Contribution
The paper presents AssistantBench, a comprehensive benchmark for web agents, and introduces SeePlanAct, a novel web agent that significantly improves performance on complex tasks.
Findings
Current models score below 26 on AssistantBench.
Closed-book LMs have low precision and hallucinate facts.
State-of-the-art web agents perform near zero on the benchmark.
Abstract
Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 26 points. While closed-book LMs perform well in terms of accuracy, they exhibit low precision and tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies · Peer-to-Peer Network Technologies
