ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji, Hu

TL;DR
This paper introduces ST$^3$, a framework that accelerates multimodal large language models by intelligently trimming visual tokens during inference, achieving about twice the speed with minimal accuracy loss.
Contribution
The paper presents a novel, training-free method for dynamic visual token trimming in MLLMs, improving inference speed while maintaining performance.
Findings
Approximately 2x faster inference speed.
Reduces KV cache memory by about 30%.
Maintains consistent accuracy across datasets.
Abstract
Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, processing the massive number of visual tokens incurs a significant computational cost. Existing analysis of the MLLM attention mechanisms remains shallow, leading to coarse-grain token pruning strategies that fail to effectively balance speed and accuracy. In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA. We find that numerous visual tokens and partial attention computations are redundant during the decoding process. Based on this insight, we propose Spatial-Temporal Visual Token Trimming (), a framework designed to accelerate MLLM inference without retraining. consists of two primary components: 1) Progressive Visual Token Pruning (\textbf{PVTP}), which eliminates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning
