ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal   Visual Token Trimming

Jiedong Zhuang; Lu Lu; Ming Dai; Rui Hu; Jian Chen; Qiang Liu; Haoji; Hu

arXiv:2412.20105·cs.CV·December 31, 2024

ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji, Hu

PDF

Open Access 1 Video

TL;DR

This paper introduces ST$^3$, a framework that accelerates multimodal large language models by intelligently trimming visual tokens during inference, achieving about twice the speed with minimal accuracy loss.

Contribution

The paper presents a novel, training-free method for dynamic visual token trimming in MLLMs, improving inference speed while maintaining performance.

Findings

01

Approximately 2x faster inference speed.

02

Reduces KV cache memory by about 30%.

03

Maintains consistent accuracy across datasets.

Abstract

Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, processing the massive number of visual tokens incurs a significant computational cost. Existing analysis of the MLLM attention mechanisms remains shallow, leading to coarse-grain token pruning strategies that fail to effectively balance speed and accuracy. In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA. We find that numerous visual tokens and partial attention computations are redundant during the decoding process. Based on this insight, we propose Spatial-Temporal Visual Token Trimming ( $ST^{3}$ ), a framework designed to accelerate MLLM inference without retraining. $ST^{3}$ consists of two primary components: 1) Progressive Visual Token Pruning (\textbf{PVTP}), which eliminates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Pruning