Accelerating Compound LLM Training Workloads with Maestro

Xiulong Yuan,Hongqing Chen,Jiaxuan Peng,Fan Zhou,Zhixiang Ruan,Zekun Wang,Bo Zheng,Rui Men,Haiquan Wang,Zhipeng Zhang,Langshi Chen,Man Yuan,Jiaqi Gao,Zhengping Qian,Junyang Lin,Yong Li,Wei Lin,Junhua Wang,Jingren Zhou

arXiv:2605.10501·cs.DC·May 12, 2026

Accelerating Compound LLM Training Workloads with Maestro

Xiulong Yuan,Hongqing Chen,Jiaxuan Peng,Fan Zhou,Zhixiang Ruan,Zekun Wang,Bo Zheng,Rui Men,Haiquan Wang,Zhipeng Zhang,Langshi Chen,Man Yuan,Jiaqi Gao,Zhengping Qian,Junyang Lin,Yong Li,Wei Lin,Junhua Wang,Jingren Zhou

PDF

TL;DR

Maestro is a novel training framework that optimizes heterogeneous and dynamic workloads in compound LLM training, significantly improving GPU utilization and reducing resource consumption.

Contribution

Maestro introduces a section-centric approach with dynamic scheduling and independent configuration, addressing static and runtime heterogeneity in compound LLM training.

Findings

01

Reduces GPU consumption by approximately 40% in key workloads.

02

Effectively handles heterogeneous components with different parallelism and execution modes.

03

Improves hardware utilization through dynamic input reordering and concurrent execution.

Abstract

Compound LLM training workloads-such as knowledge distillation and multimodal LLM (MLLM) training-are gaining prominence. These typically comprise heterogeneous components differing in parameter scale, execution mode (forward-only or full forward-backward), and sequence length. Besides, component activation can be data-dependent: in MLLM training, modality-specific parts activate only when inputs contain corresponding modalities, causing dynamic computational paths and irregular runtime workloads. Conventional frameworks, designed for monolithic models, cannot handle the dual heterogeneity-static (across components) and dynamic (runtime). By enforcing one-size-fits-all training configurations across components and ignoring input-induced variations, they suffer suboptimal throughput and poor GPU utilization. In this paper, we introduce Maestro, a section-centric training framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.