Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

Dingkun Zhang; Shuhan Qi; Yulin Wu; Xinyu Xiao; Xuan Wang; Long Chen

arXiv:2602.03815·cs.CV·February 4, 2026

Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

Dingkun Zhang, Shuhan Qi, Yulin Wu, Xinyu Xiao, Xuan Wang, Long Chen

PDF

Open Access

TL;DR

The paper introduces DualSpeed, a dual-mode training framework for multimodal large language models that combines fast token pruning with full sequence training to improve efficiency without sacrificing performance.

Contribution

It proposes a novel dual-mode training approach that integrates visual token pruning with full sequence training and self-distillation for efficient multimodal model training.

Findings

01

Accelerates LLaVA-1.5 training by 2.1×

02

Speeds up LLaVA-NeXT training by 4.0×

03

Retains over 99% of the original performance

Abstract

Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training-inference mismatch: pruning-trained models perform poorly when inferring on non-pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast-slow framework for efficient training of MLLMs. The fast-mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis