Tur[k]ingBench: A Challenge Benchmark for Web Agents

Kevin Xu; Yeganeh Kordi; Tanay Nayak; Adi Asija; Yizhong Wang; Kate; Sanders; Adam Byerly; Jingyu Zhang; Benjamin Van Durme; Daniel Khashabi

arXiv:2403.11905·cs.AI·February 25, 2025·1 cites

Tur[k]ingBench: A Challenge Benchmark for Web Agents

Kevin Xu, Yeganeh Kordi, Tanay Nayak, Adi Asija, Yizhong Wang, Kate, Sanders, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi

PDF

Open Access 1 Video

TL;DR

TurkingBench is a comprehensive benchmark with diverse web-based tasks designed to evaluate multi-modal models' ability to perform complex annotation tasks on natural HTML pages, aiming to advance web agent development.

Contribution

This paper introduces TurkingBench, a novel benchmark with real HTML tasks and a framework linking chatbot responses to web actions, enabling evaluation of multi-modal web agents.

Findings

01

Models outperform random chance but still have significant room for improvement.

02

Benchmark includes 32.2K instances across 158 tasks, providing a diverse evaluation platform.

03

Evaluation of models like GPT-4 and InternVL demonstrates current capabilities and gaps.

Abstract

Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TurkingBench: A Challenge Benchmark for Web Agents· underline

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Peer-to-Peer Network Technologies