TTF: Temporal Token Fusion for Efficient Video-Language Model

Simin Huo; Ning LI

arXiv:2605.07355·cs.CV·May 11, 2026

TTF: Temporal Token Fusion for Efficient Video-Language Model

Simin Huo, Ning LI

PDF

1 Repo

TL;DR

TTF is a training-free, plug-and-play method that reduces visual tokens in video-language models by exploiting temporal redundancy, significantly improving efficiency with minimal accuracy loss.

Contribution

We introduce TTF, a novel token compression framework that automatically fuses tokens based on local similarity, enabling efficient video understanding without retraining.

Findings

01

TTF removes about 67% of visual tokens while retaining 99.5% of accuracy.

02

TTF introduces only approximately 0.16 GFLOPs of overhead.

03

TTF is effective on Qwen3-VL-8B with a threshold of 0.70.

Abstract

Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448 \times 448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbf{Temporal Token Fusion (TTF)}, a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g., $3 \times 3$ ), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Cominder/ttf
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.