VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web   Tasks

Jing Yu Koh; Robert Lo; Lawrence Jang; Vikram Duvvur; Ming Chong Lim,; Po-Yu Huang; Graham Neubig; Shuyan Zhou; Ruslan Salakhutdinov; Daniel Fried

arXiv:2401.13649·cs.LG·June 7, 2024·1 cites

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim,, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

PDF

Open Access 1 Repo 1 Video

TL;DR

VisualWebArena is a new benchmark for evaluating multimodal web agents on realistic visual tasks, highlighting current limitations and guiding future improvements in multimodal autonomous web agents.

Contribution

Introduces VisualWebArena, a benchmark for assessing multimodal web agents on complex visual tasks, filling a gap in existing text-only web automation benchmarks.

Findings

01

Text-only agents have significant limitations on visual tasks.

02

Multimodal models show improved performance but still have notable gaps.

03

Benchmark reveals specific challenges in processing image-text inputs and executing web actions.

Abstract

Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

web-arena-x/visualwebarena
noneOfficial

Videos

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Focus