Tur[k]ingBench: A Challenge Benchmark for Web Agents
Kevin Xu, Yeganeh Kordi, Tanay Nayak, Adi Asija, Yizhong Wang, Kate, Sanders, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi

TL;DR
TurkingBench is a comprehensive benchmark with diverse web-based tasks designed to evaluate multi-modal models' ability to perform complex annotation tasks on natural HTML pages, aiming to advance web agent development.
Contribution
This paper introduces TurkingBench, a novel benchmark with real HTML tasks and a framework linking chatbot responses to web actions, enabling evaluation of multi-modal web agents.
Findings
Models outperform random chance but still have significant room for improvement.
Benchmark includes 32.2K instances across 158 tasks, providing a diverse evaluation platform.
Evaluation of models like GPT-4 and InternVL demonstrates current capabilities and gaps.
Abstract
Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Peer-to-Peer Network Technologies
