The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference

Vignesh Adhinarayanan; Nuwan Jayasena

arXiv:2603.08960·cs.LG·March 11, 2026

The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference

Vignesh Adhinarayanan, Nuwan Jayasena

PDF

Open Access

TL;DR

This paper reveals a structural double penalty in Mixture-of-Experts models during inference, formalizes it with the $qs$ inequality, and demonstrates its impact on model throughput and feasibility across various architectures.

Contribution

The paper introduces the $qs$ inequality to predict when MoE models are disadvantaged compared to dense models, highlighting a fundamental architectural limitation.

Findings

01

MoE models suffer from reuse fragmentation during inference.

02

The $qs$ inequality unifies sparsity and quality factors to predict MoE disadvantages.

03

Massive MoE architectures can become infeasible on large clusters, unlike dense models.

Abstract

Mixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths. We introduce the $q s$ inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity ( $s$ ), the fraction of parameters activated per token, and the quality-equivalence factor ( $q$ ), the size multiplier required for a dense model to match MoE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications