The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference
Vignesh Adhinarayanan, Nuwan Jayasena

TL;DR
This paper reveals a structural double penalty in Mixture-of-Experts models during inference, formalizes it with the $qs$ inequality, and demonstrates its impact on model throughput and feasibility across various architectures.
Contribution
The paper introduces the $qs$ inequality to predict when MoE models are disadvantaged compared to dense models, highlighting a fundamental architectural limitation.
Findings
MoE models suffer from reuse fragmentation during inference.
The $qs$ inequality unifies sparsity and quality factors to predict MoE disadvantages.
Massive MoE architectures can become infeasible on large clusters, unlike dense models.
Abstract
Mixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths. We introduce the inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity (), the fraction of parameters activated per token, and the quality-equivalence factor (), the size multiplier required for a dense model to match MoE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
