MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video   Understanding

Bo He; Hengduo Li; Young Kyun Jang; Menglin Jia; Xuefei Cao; Ashish; Shah; Abhinav Shrivastava; Ser-Nam Lim

arXiv:2404.05726·cs.CV·April 25, 2024·1 cites

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish, Shah, Abhinav Shrivastava, Ser-Nam Lim

PDF

Open Access 1 Repo

TL;DR

This paper introduces MA-LMM, a memory-augmented multimodal model that enables long-term video understanding by storing and referencing historical video data, overcoming the limitations of existing models in processing extended video sequences.

Contribution

The paper presents a novel memory bank mechanism integrated into multimodal LLMs, allowing efficient long-term video analysis without exceeding context or memory limits.

Findings

01

Achieves state-of-the-art results on multiple long-video understanding datasets

02

Effectively handles long-term video question answering and captioning tasks

03

Demonstrates seamless integration of memory bank into existing multimodal models

Abstract

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

boheumd/MA-LMM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsFocus