Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim

TL;DR
The paper introduces STTM, a training-free spatio-temporal token merging method that exploits local redundancies in video data to accelerate Video LLMs without significant accuracy loss.
Contribution
STTM is a novel, training-free token merging approach that improves video understanding efficiency by leveraging local redundancies and multi-granular spatial tokens.
Findings
Achieves 2x speed-up with 0.5% accuracy drop at 50% token budget.
Achieves 3x speed-up with 2% accuracy drop at 30% token budget.
Outperforms existing token reduction methods across six video QA benchmarks.
Abstract
Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2 speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3 speed-up with just a 2% drop under a 30% budget. Moreover, STTM is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
