TL;DR
TTF is a training-free, plug-and-play method that reduces visual tokens in video-language models by exploiting temporal redundancy, significantly improving efficiency with minimal accuracy loss.
Contribution
We introduce TTF, a novel token compression framework that automatically fuses tokens based on local similarity, enabling efficient video understanding without retraining.
Findings
TTF removes about 67% of visual tokens while retaining 99.5% of accuracy.
TTF introduces only approximately 0.16 GFLOPs of overhead.
TTF is effective on Qwen3-VL-8B with a threshold of 0.70.
Abstract
Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbf{Temporal Token Fusion (TTF)}, a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
