VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy

TL;DR
VideoAgent introduces an agent-based framework utilizing large language models and vision-language tools to improve long-form video understanding through interactive reasoning, achieving state-of-the-art zero-shot accuracy on challenging benchmarks.
Contribution
The paper presents a novel agent-based system that leverages large language models for iterative reasoning and planning in long-form video understanding, with efficient visual information retrieval.
Findings
Achieves 54.1% zero-shot accuracy on EgoSchema
Achieves 71.3% zero-shot accuracy on NExT-QA
Uses only 8.4 and 8.2 frames on average for inference
Abstract
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
