LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts
Venmugil Elango, Nidhi Bhatia, Roger Waleffe, Rasoul Shafipour, Tomer Asida, Abhinav Khattar, Nave Assaf, Maximilian Golub, Joey Guman, Tiyasa Mitra, Ritchie Zhao, Ritika Borkar, Ran Zilberstein, Mostofa Patwary, Mohammad Shoeybi, Bita Rouhani

TL;DR
This paper introduces LatentMoE, a new mixture of experts architecture optimized for maximum accuracy relative to compute cost, validated through extensive empirical and theoretical analysis, outperforming standard MoE designs.
Contribution
The paper presents LatentMoE, a novel MoE architecture designed via hardware-software co-design to optimize accuracy per FLOP and parameter, with extensive empirical validation.
Findings
LatentMoE outperforms standard MoE architectures in accuracy per FLOP and parameter.
Empirical exploration at scales up to 95B parameters demonstrates superior efficiency.
Theoretical analysis supports the empirical results, confirming optimality.
Abstract
Mixture of Experts (MoEs) have become a central component of many state-of-the-art open-source and proprietary large language models. Despite their widespread adoption, it remains unclear how close existing MoE architectures are to optimal with respect to inference cost, as measured by accuracy per floating-point operation and per parameter. In this work, we revisit MoE design from a hardware-software co-design perspective, grounded in empirical and theoretical considerations. We characterize key performance bottlenecks across diverse deployment regimes, spanning offline high-throughput execution and online, latency-critical inference. Guided by these insights, we introduce LatentMoE, a new model architecture resulting from systematic design exploration and optimized for maximal accuracy per unit of compute. Empirical design space exploration at scales of up to 95B parameters and over a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Mobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning
