Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Vima Gupta; Jae Hyung Ju; Kartik Sinha; Ada Gavrilovska; Anand Padmanabha Iyer

arXiv:2411.08982·cs.LG·May 20, 2026

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Vima Gupta, Jae Hyung Ju, Kartik Sinha, Ada Gavrilovska, Anand Padmanabha Iyer

PDF

TL;DR

LYNX is a system that improves the efficiency of MoE model inference by dynamically remapping token-to-expert assignments, reducing expert activation and increasing throughput without significant accuracy loss.

Contribution

LYNX introduces a workload-agnostic method using AffinityBinning to optimize expert activation in MoE inference, addressing batching inefficiencies.

Findings

01

Up to 1.30x throughput improvement across models and benchmarks.

02

Maintains less than 1% accuracy loss.

03

Enhances existing techniques by up to 1.38x.

Abstract

Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance in serving, forces the activation of all experts, thereby negating MoEs' benefits and exacerbating memory bandwidth bottlenecks. Existing work on efficient MoE inference are unable to resolve this tension even with extensive workload-specific tuning. We present LYNX, a system that enables efficient MoE inference in a workload-agnostic fashion. LYNX leverages a key property of MoE training: load-balancing losses introduce batch-level expert activation skews and redundancy, which it exploits by remapping low-affinity token-to-expert assignments within each batch using a novel AffinityBinning technique that reduces the total experts invoked. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Context-Aware Activity Recognition Systems · Time Series Analysis and Forecasting

MethodsMixture of Experts