Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures
Shuqing Luo, Ye Han, Pingzhi Li, Jiayin Qin, Jie Peng, Yang (Katie) Zhao, Yu (Kevin) Cao, Tianlong Chen

TL;DR
Mozart is a co-designed algorithm-hardware framework that enhances the training efficiency of MoE-based large language models on 3.5D wafer-scale chiplet architectures by exploiting modularity and optimizing communication and computation.
Contribution
It introduces a novel co-design framework combining algorithms and hardware optimizations specifically for MoE LLM training on wafer-scale chiplets.
Findings
Significant efficiency improvements in MoE model training.
Enhanced parallelization and resource utilization.
Effective communication strategies for chiplet architectures.
Abstract
Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose Mozart, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, Mozart exploits the inherent modularity of chiplets and introduces: (1) an expert allocation strategy that enables efficient on-package all-to-all communication, and (2) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, Mozart adaptively co-locates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques
