Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

Yuxiao Chen; Jue Wang; Zhikang Zhang; Jingru Yi; Xu Zhang; Yang Zou; Zhaowei Cai; Jianbo Yuan; Xinyu Li; Hao Yang; Davide Modolo

arXiv:2602.17869·cs.CV·February 23, 2026

Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

Yuxiao Chen, Jue Wang, Zhikang Zhang, Jingru Yi, Xu Zhang, Yang Zou, Zhaowei Cai, Jianbo Yuan, Xinyu Li, Hao Yang, Davide Modolo

PDF

Open Access

TL;DR

This paper presents a novel end-to-end framework combining adaptive sampling and compression techniques to improve long-form video understanding with large multimodal models, effectively handling large volumes of redundant video data.

Contribution

It introduces an adaptive video sampler and a spatiotemporal compressor integrated with multimodal LLMs, enabling efficient processing of lengthy videos with high information retention.

Findings

01

Achieves high compression rates while preserving key information.

02

Demonstrates superior performance on long-form video benchmarks.

03

Effectively manages large, redundant video sequences.

Abstract

With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis