VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Lawrence Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu, Liang, Rogerio Bonatti, Kazuhito Koishida

TL;DR
VideoWebArena introduces a comprehensive benchmark for evaluating long-context multimodal video understanding in agents, highlighting current performance gaps and guiding future improvements in skill and factual retention tasks.
Contribution
The paper presents VideoWebArena, a new benchmark with 2,021 tasks for assessing long-context video understanding in multimodal agents, emphasizing skill and factual retention.
Findings
Best model achieves 13.3% success on factual retention tasks
Models perform worse with tutorials than without, showing a 5-10.3% decrease
Human performance exceeds models with 73.9-79.3% success
Abstract
Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
