Visual Context Window Extension: A New Perspective for Long Video   Understanding

Hongchen Wei; Zhenzhong Chen

arXiv:2409.20018·cs.CV·October 3, 2024

Visual Context Window Extension: A New Perspective for Long Video Understanding

Hongchen Wei, Zhenzhong Chen

PDF

Open Access

TL;DR

This paper proposes extending the visual context window in large multimodal models to improve long video understanding without retraining, using a progressive pooling strategy to reduce memory consumption and enhance performance.

Contribution

The authors introduce a novel approach to adapt LMMs for long videos by extending visual context windows without retraining, addressing modality discrepancies and memory challenges.

Findings

01

Outperforms GPT-4o on MLVU benchmark with only 7B parameters.

02

Reduces memory usage by approximately 45% in 256-frame setting.

03

Consistently improves performance with increasing video frames.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Image and Video Quality Assessment