VideoWebArena: Evaluating Long Context Multimodal Agents with Video   Understanding Web Tasks

Lawrence Jang; Yinheng Li; Dan Zhao; Charles Ding; Justin Lin; Paul Pu; Liang; Rogerio Bonatti; Kazuhito Koishida

arXiv:2410.19100·cs.CV·February 18, 2025

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Lawrence Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu, Liang, Rogerio Bonatti, Kazuhito Koishida

PDF

Open Access 1 Repo

TL;DR

VideoWebArena introduces a comprehensive benchmark for evaluating long-context multimodal video understanding in agents, highlighting current performance gaps and guiding future improvements in skill and factual retention tasks.

Contribution

The paper presents VideoWebArena, a new benchmark with 2,021 tasks for assessing long-context video understanding in multimodal agents, emphasizing skill and factual retention.

Findings

01

Best model achieves 13.3% success on factual retention tasks

02

Models perform worse with tutorials than without, showing a 5-10.3% decrease

03

Human performance exceeds models with 73.9-79.3% success

Abstract

Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ljang0/videowebarena
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques