CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

Adrian Zhao; Zhenkun Cai; Zhenyu Song; Lingfan Yu; Haozheng Fan; Jun Wu; Yida Wang; Nandita Vijaykumar

arXiv:2603.28768·cs.DC·April 1, 2026

CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lingfan Yu, Haozheng Fan, Jun Wu, Yida Wang, Nandita Vijaykumar

PDF

TL;DR

CRAFT is a novel expert replication framework that optimizes load balancing and resource utilization in large-scale language model serving, significantly improving throughput without retraining.

Contribution

It introduces fine-grained, layerwise replication based on estimated benefits, outperforming existing schemes under memory constraints.

Findings

01

CRAFT increases serving throughput by 1.14x on average.

02

It achieves up to 1.2x throughput improvement.

03

CRAFT effectively balances load with minimal resource overhead.

Abstract

Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.