Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
Vima Gupta, Jae Hyung Ju, Kartik Sinha, Ada Gavrilovska, Anand Padmanabha Iyer

TL;DR
LYNX is a system that improves the efficiency of MoE model inference by dynamically remapping token-to-expert assignments, reducing expert activation and increasing throughput without significant accuracy loss.
Contribution
LYNX introduces a workload-agnostic method using AffinityBinning to optimize expert activation in MoE inference, addressing batching inefficiencies.
Findings
Up to 1.30x throughput improvement across models and benchmarks.
Maintains less than 1% accuracy loss.
Enhances existing techniques by up to 1.38x.
Abstract
Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance in serving, forces the activation of all experts, thereby negating MoEs' benefits and exacerbating memory bandwidth bottlenecks. Existing work on efficient MoE inference are unable to resolve this tension even with extensive workload-specific tuning. We present LYNX, a system that enables efficient MoE inference in a workload-agnostic fashion. LYNX leverages a key property of MoE training: load-balancing losses introduce batch-level expert activation skews and redundancy, which it exploits by remapping low-affinity token-to-expert assignments within each batch using a novel AffinityBinning technique that reduces the total experts invoked. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Context-Aware Activity Recognition Systems · Time Series Analysis and Forecasting
MethodsMixture of Experts
