AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive   Cross-Modality Memory Reduction

Yuanbin Man; Ying Huang; Chengming Zhang; Bingzhe Li; Wei Niu; Miao; Yin

arXiv:2411.12593·cs.CV·April 7, 2025

AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

Yuanbin Man, Ying Huang, Chengming Zhang, Bingzhe Li, Wei Niu, Miao, Yin

PDF

Open Access

TL;DR

AdaCM$^2$ introduces an adaptive cross-modality memory reduction technique for long-term video understanding, effectively aligning video and text data, improving performance, and significantly reducing memory usage across various tasks.

Contribution

It is the first to incorporate adaptive cross-modality memory reduction for long-term video-text alignment in an auto-regressive framework.

Findings

01

Achieves state-of-the-art results on multiple video understanding tasks.

02

Improves performance by 4.5% on LVU dataset.

03

Reduces GPU memory consumption by up to 65%.

Abstract

The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM $^{2}$ , which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Image and Signal Denoising Methods · Advanced Image Processing Techniques