Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

Yigui Feng (1); Qinglin Wang (1); Yang Liu (2); Jie Liu (1) ((1) The College of Computer Science; National University of Defense Technology; Changsha; Hunan; China; (2) The Shien-Ming Wu School of Intelligent Engineering; South China University of Technology; Guangzhou; Guangdong; China)

arXiv:2605.16366·cs.CV·May 19, 2026

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

Yigui Feng (1), Qinglin Wang (1), Yang Liu (2), Jie Liu (1) ((1) The College of Computer Science, National University of Defense Technology, Changsha, Hunan, China, (2) The Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou

PDF

TL;DR

Fre-Res introduces a dual-track video token compression method that separates spatial and temporal evidence, using frequency residuals and spatial anchors to improve efficiency without sacrificing accuracy.

Contribution

It proposes a novel frequency-residual compression framework with a Spatial-Guided Absorber, enhancing video MLLMs' efficiency and accuracy in diverse video reasoning tasks.

Findings

01

Achieves near full-token performance with reduced token length.

02

Effectively captures temporal dynamics using low-frequency residuals.

03

Maintains fine-grained reasoning with spatial anchors.

Abstract

Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.