KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models
Haifeng Huang, Yang Li

TL;DR
KiToke is a training-free, kernel-based token compression method for Video LLMs that reduces redundancy and maintains performance at low token budgets.
Contribution
It introduces a global redundancy measure and interval-aware token merging, outperforming prior local heuristic methods without additional training.
Findings
Outperforms existing training-free compression methods on multiple benchmarks.
Maintains high accuracy even at 1% token retention ratio.
Efficiently captures global redundancy for better token utilization.
Abstract
Video Large Language Models (Video LLMs) achieve strong performance on video understanding tasks but suffer from high inference costs due to the large number of visual tokens. We propose KiToke, a training-free, query-agnostic token compression approach that reduces spatiotemporal redundancy while preserving critical visual information. Our method estimates token diversity globally using a kernel-based redundancy measure, enabling content-adaptive selection that remains effective under extreme token budgets, and further introduces a lightweight temporal interval construction with interval-aware token merging to maintain temporal coherence. Unlike prior methods that rely on local or segment-level heuristics, KiToke explicitly captures global redundancy across an entire video, leading to more efficient token utilization. Extensive experiments on multiple video understanding benchmarks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
