FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Guoyang Xia; Yifeng Ding; Fengfa Li; Lei Ren; Wei Chen; Fangxiang Feng; Xiaojie Wang

arXiv:2511.17885·cs.CV·March 23, 2026

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang

PDF

Open Access

TL;DR

FastMMoE is a training-free framework that accelerates multimodal large language models by reducing redundant visual tokens through expert activation reduction and routing-aware token pruning, significantly lowering FLOPs while maintaining high performance.

Contribution

It introduces a novel, training-free acceleration method for MoE-based MLLMs that combines expert activation reduction and routing-aware token pruning from a routing analysis perspective.

Findings

01

FLOPs reduced by up to 55% with minimal performance loss

02

Outperforms dense-model pruning baselines like FastV and SparseVLM

03

Maintains approximately 95.5% of original performance

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning