Accelerating Compound LLM Training Workloads with Maestro
Xiulong Yuan,Hongqing Chen,Jiaxuan Peng,Fan Zhou,Zhixiang Ruan,Zekun Wang,Bo Zheng,Rui Men,Haiquan Wang,Zhipeng Zhang,Langshi Chen,Man Yuan,Jiaqi Gao,Zhengping Qian,Junyang Lin,Yong Li,Wei Lin,Junhua Wang,Jingren Zhou

TL;DR
Maestro is a novel training framework that optimizes heterogeneous and dynamic workloads in compound LLM training, significantly improving GPU utilization and reducing resource consumption.
Contribution
Maestro introduces a section-centric approach with dynamic scheduling and independent configuration, addressing static and runtime heterogeneity in compound LLM training.
Findings
Reduces GPU consumption by approximately 40% in key workloads.
Effectively handles heterogeneous components with different parallelism and execution modes.
Improves hardware utilization through dynamic input reordering and concurrent execution.
Abstract
Compound LLM training workloads-such as knowledge distillation and multimodal LLM (MLLM) training-are gaining prominence. These typically comprise heterogeneous components differing in parameter scale, execution mode (forward-only or full forward-backward), and sequence length. Besides, component activation can be data-dependent: in MLLM training, modality-specific parts activate only when inputs contain corresponding modalities, causing dynamic computational paths and irregular runtime workloads. Conventional frameworks, designed for monolithic models, cannot handle the dual heterogeneity-static (across components) and dynamic (runtime). By enforcing one-size-fits-all training configurations across components and ignoring input-induced variations, they suffer suboptimal throughput and poor GPU utilization. In this paper, we introduce Maestro, a section-centric training framework that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
