Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun; Sukjun Hwang; Su Ho Han; Taeoh Kim; Inwoong Lee; Dongyoon Wee; Joon-Young Lee; Seon Joo Kim; Minho Shim

arXiv:2507.07990·cs.CV·July 11, 2025

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim

PDF

Open Access

TL;DR

The paper introduces STTM, a training-free spatio-temporal token merging method that exploits local redundancies in video data to accelerate Video LLMs without significant accuracy loss.

Contribution

STTM is a novel, training-free token merging approach that improves video understanding efficiency by leveraging local redundancies and multi-granular spatial tokens.

Findings

01

Achieves 2x speed-up with 0.5% accuracy drop at 50% token budget.

02

Achieves 3x speed-up with 2% accuracy drop at 30% token budget.

03

Outperforms existing token reduction methods across six video QA benchmarks.

Abstract

Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2 $\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3 $\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning