FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging
Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, Zhuotao Tian

TL;DR
FlashVID is a training-free framework that significantly accelerates video large language models by intelligently merging spatiotemporal tokens, maintaining high performance with much lower computational cost.
Contribution
The paper introduces FlashVID, a novel training-free method combining attention-based token selection and tree-based spatiotemporal merging for efficient video understanding.
Findings
Retains 99.1% of performance while using only 10% of tokens.
Achieves 10x input frame extension with minimal performance loss.
Demonstrates effectiveness across multiple VLLMs and benchmarks.
Abstract
Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal…
Peer Reviews
Decision·ICLR 2026 Oral
1. The idea of jointly modeling spatial and temporal redundancy via a tree-based merging method is well-motivated and insightful. 2. Another component, ADTS, focusing on intra-frame redundancy, is organically combined with TSTM in the pipeline. And they are complementary. 3. Experiments are done in multiple benchmarks, with multiple backbones, compared with several previous SOTAs. The proposed model can consistently outperform them.
1. When comparing with other previous works, only the retention ratio and TFLOPs are used in the paper to measure the efficiency. However, token pruning is also time-consuming. Given the proposed method has many steps (eg, tree construction, similarity calculation, attention mask computation, etc), the pruning inference time complexity, and potential memory cost should also be comprehensively analyzed and compared with previous works. 2. In the paper, the author(s) claim to use a hybrid compres
- The paper identifies a clear limitation in existing methods that temporal redundancy is typically defined by fixed spatial locations, which fails to capture video dynamics where objects move, scale, and rotate. The tree-based spatiotemporal merging is an elegant solution to this problem. - The experiments are comprehensive, covering three diverse VLLMs and five benchmarks. The results consistently show FlashVID outperforming baselines. - The method requires no additional training, making it
- Limited novelty in individual components: ADTS essentially combines existing techniques ([CLS] attention + diversity-based selection via MMDP). The calibrated MMDP formulation (Algorithm 4) is relatively straightforward. The tree construction in TSTM (Algorithm 1, lines 9-16) is a simple greedy nearest-neighbor matching with thresholding. The "tree" structure emerges naturally but isn't explicitly optimized. - The paper states that depth and breadth constraints "yielded negligible gains" (pag
1. FlashVID is a training-free framework, can be used as a plug-and-play module applied to existing trained VLLMs without expensive training costs. 2. Addressed a key pain point of previous VLLM acceleration methods: they typically compress spatial and temporal redundancies independently, or rely on a single spatial correspondence for temporal merging. TSTM can flexibly track and merge similar tokens that change dynamically over time in terms of spatial location, scale, or direction by building
1. This method requires multiple hyperparameters that need empirical tuning, which may affect its plug-and-play performance across different models or datasets. For example $T_{\tau}$, $f_{e}$, $\alpha$。 2. The paper mentions the interesting phenomenon of "less is more" and attributes it to "excessive visual token input may introduce noise," which is a reasonable assumption but lacks more in-depth quantitative or qualitative analysis to clarify how these "noise" specifically affect VLLM's attent
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
