DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs
Minyoung Park, Taehun Kong, Sangjun Ahn

TL;DR
DynaTok is a training-free, adaptive token compression framework for Video-LLMs that dynamically allocates tokens across space and time, significantly reducing computational costs while maintaining high accuracy.
Contribution
It introduces a novel temporally adaptive and bias-aware token compression method that allocates tokens efficiently without retraining Video-LLMs.
Findings
Retains over 95% accuracy with 90% token reduction.
Outperforms recent training-free approaches on multiple benchmarks.
Effectively captures long-term temporal variation and semantic diversity.
Abstract
Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
