Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Shuqing Luo; Ye Han; Pingzhi Li; Jiayin Qin; Jie Peng; Yang (Katie) Zhao; Yu (Kevin) Cao; Tianlong Chen

arXiv:2603.07006·cs.AR·March 10, 2026

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Shuqing Luo, Ye Han, Pingzhi Li, Jiayin Qin, Jie Peng, Yang (Katie) Zhao, Yu (Kevin) Cao, Tianlong Chen

PDF

Open Access

TL;DR

Mozart is a co-designed algorithm-hardware framework that enhances the training efficiency of MoE-based large language models on 3.5D wafer-scale chiplet architectures by exploiting modularity and optimizing communication and computation.

Contribution

It introduces a novel co-design framework combining algorithms and hardware optimizations specifically for MoE LLM training on wafer-scale chiplets.

Findings

01

Significant efficiency improvements in MoE model training.

02

Enhanced parallelization and resource utilization.

03

Effective communication strategies for chiplet architectures.

Abstract

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose Mozart, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, Mozart exploits the inherent modularity of chiplets and introduces: (1) an expert allocation strategy that enables efficient on-package all-to-all communication, and (2) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, Mozart adaptively co-locates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques