CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations
Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lingfan Yu, Haozheng Fan, Jun Wu, Yida Wang, Nandita Vijaykumar

TL;DR
CRAFT is a novel expert replication framework that optimizes load balancing and resource utilization in large-scale language model serving, significantly improving throughput without retraining.
Contribution
It introduces fine-grained, layerwise replication based on estimated benefits, outperforming existing schemes under memory constraints.
Findings
CRAFT increases serving throughput by 1.14x on average.
It achieves up to 1.2x throughput improvement.
CRAFT effectively balances load with minimal resource overhead.
Abstract
Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
