VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim,, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

TL;DR
VisualWebArena is a new benchmark for evaluating multimodal web agents on realistic visual tasks, highlighting current limitations and guiding future improvements in multimodal autonomous web agents.
Contribution
Introduces VisualWebArena, a benchmark for assessing multimodal web agents on complex visual tasks, filling a gap in existing text-only web automation benchmarks.
Findings
Text-only agents have significant limitations on visual tasks.
Multimodal models show improved performance but still have notable gaps.
Benchmark reveals specific challenges in processing image-text inputs and executing web actions.
Abstract
Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Focus
