VideoAgent: Long-form Video Understanding with Large Language Model as   Agent

Xiaohan Wang; Yuhui Zhang; Orr Zohar; Serena Yeung-Levy

arXiv:2403.10517·cs.CV·March 18, 2024·1 cites

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy

PDF

Open Access 2 Repos

TL;DR

VideoAgent introduces an agent-based framework utilizing large language models and vision-language tools to improve long-form video understanding through interactive reasoning, achieving state-of-the-art zero-shot accuracy on challenging benchmarks.

Contribution

The paper presents a novel agent-based system that leverages large language models for iterative reasoning and planning in long-form video understanding, with efficient visual information retrieval.

Findings

01

Achieves 54.1% zero-shot accuracy on EgoSchema

02

Achieves 71.3% zero-shot accuracy on NExT-QA

03

Uses only 8.4 and 8.2 frames on average for inference

Abstract

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications