Small Vision-Language Models are Smart Compressors for Long Video Understanding

Junjie Fei; Jun Chen; Zechun Liu; Yunyang Xiong; Chong Zhou; Wei Wen; Junlin Han; Mingchen Zhuge; Saksham Suri; Qi Qian; Shuming Liu; Lemeng Wu; Raghuraman Krishnamoorthi; Vikas Chandra; Mohamed Elhoseiny; and Chenchen Zhu

arXiv:2604.08120·cs.CV·April 10, 2026

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou, Wei Wen, Junlin Han, Mingchen Zhuge, Saksham Suri, Qi Qian, Shuming Liu, Lemeng Wu, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, and Chenchen Zhu

PDF

1 Repo 4 Models

TL;DR

Tempo is a novel framework that efficiently compresses long videos using a small vision-language model and adaptive token allocation, enabling better understanding within strict token budgets.

Contribution

It introduces Tempo, a query-aware video compression method with adaptive token allocation, improving long video understanding with minimal information loss.

Findings

01

Achieves state-of-the-art performance with aggressive compression (0.5-16 tokens/frame).

02

Outperforms GPT-4o and Gemini 1.5 Pro on LVBench (4101s videos).

03

Effectively compresses hour-long videos below theoretical limits.

Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

feielysia/Tempo
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.