VidCompress: Memory-Enhanced Temporal Compression for Video   Understanding in Large Language Models

Xiaohan Lan; Yitian Yuan; Zequn Jie; Lin Ma

arXiv:2410.11417·cs.CV·October 16, 2024·2 cites

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma

PDF

Open Access

TL;DR

VidCompress introduces a memory-enhanced temporal compression method for Video-LLMs, enabling better modeling of temporal relations and improved performance on video understanding tasks, especially for longer videos.

Contribution

It proposes a dual-compressor approach with memory mechanisms and multiscale transformers, advancing video comprehension in large language models.

Findings

01

Outperforms existing Video-LLMs on VideoQA datasets

02

Efficiently models complex temporal-spatial relations

03

Handles longer videos effectively

Abstract

Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose VidCompress, a novel Video-LLM featuring memory-enhanced temporal compression. VidCompress employs a dual-compressor approach: a memory-enhanced compressor captures both short-term and long-term temporal relationships in videos and compresses the visual tokens using a multiscale transformer with a memory-cache mechanism, while a text-perceived compressor generates condensed visual tokens by utilizing Q-Former and integrating temporal contexts into query embeddings with cross…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsSparse Evolutionary Training