BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE
Juntong Wu, Jialiang Cheng, Qishen Yin, Yue Dai, Yuliang Yan, Fuyu Lv, Ou Dan, Li Yuan

TL;DR
BEAM introduces trainable binary masks for dynamic expert selection in MoE models, significantly reducing computation and improving inference speed while preserving most of the original model's performance.
Contribution
It presents a novel end-to-end trainable method for token-adaptive expert routing using binary masks, with an efficient CUDA implementation for practical MoE acceleration.
Findings
Retains over 98% of original model performance.
Reduces MoE layer FLOPs by up to 85%.
Achieves up to 2.5× faster decoding and 1.4× higher throughput.
Abstract
Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
