VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Yue Fan; Xiaojian Ma; Rujie Wu; Yuntao Du; Jiaqi Li; Zhi Gao; Qing Li

arXiv:2403.11481·cs.CV·July 16, 2024·2 cites

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li

PDF

Open Access

TL;DR

VideoAgent introduces a memory-augmented multimodal framework that leverages foundation models and structured memory to improve long-term video understanding, especially for lengthy videos with complex temporal relations.

Contribution

It presents a novel unified memory mechanism and multimodal agent architecture that enhances long-term video understanding by integrating various foundation models and memory modules.

Findings

01

Achieved 6.6% improvement on NExT-QA benchmark.

02

Achieved 26.0% improvement on EgoSchema benchmark.

03

Close to private model performance with open-source approach.

Abstract

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications