MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

Yinsicheng Jiang; Yao Fu; Yeqi Huang; Ping Nie; Zhan Lu; Leyang Xue; Congjie He; Man-Kit Sit; Jilong Xue; Li Dong; Ziming Miao; Dayou Du; Tairan Xu; Kai Zou; Edoardo Ponti; Luo Mai

arXiv:2505.11415·cs.LG·May 22, 2025

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai

PDF

Open Access

TL;DR

MoE-CAP introduces a specialized benchmark for sparse Mixture-of-Experts systems, highlighting the trade-offs between cost, accuracy, and performance, and providing tools for better deployment decisions across hardware platforms.

Contribution

The paper presents MoE-CAP, a new benchmark and metrics to evaluate the cost, accuracy, and performance trade-offs in MoE systems, addressing limitations of existing benchmarks.

Findings

01

Achieving optimal CAP balance is challenging with current hardware.

02

MoE systems tend to optimize two of the three CAP dimensions at the expense of the third.

03

The CAP Radar Diagram visualizes trade-offs effectively.

Abstract

The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning · Sparse and Compressive Sensing Techniques

MethodsMixture of Experts