MoFa: A Unified Performance Modeling Framework for LLM Pretraining

Lu Zhao; Rong Shi; Shaoqing Zhang; Shangchao Su; Ziqing Yin; Zhiyan Cui; Hongfeng Sun; Baoguo He; Yueqiang Chen; Liang Dong; Xiyuan Li; Lingbin Wang; Lijun Ma; Qiang Huang; Ting Liu; Chong Wang; Can Wei

arXiv:2511.09837·cs.DC·November 21, 2025

MoFa: A Unified Performance Modeling Framework for LLM Pretraining

Lu Zhao, Rong Shi, Shaoqing Zhang, Shangchao Su, Ziqing Yin, Zhiyan Cui, Hongfeng Sun, Baoguo He, Yueqiang Chen, Liang Dong, Xiyuan Li, Lingbin Wang, Lijun Ma, Qiang Huang, Ting Liu, Chong Wang, Can Wei

PDF

Open Access

TL;DR

MoFa is a comprehensive performance modeling framework for large language model pretraining that accounts for optimization strategies and fault tolerance, enabling accurate predictions and guiding system design.

Contribution

It introduces a unified modeling approach that integrates optimization features and fault tolerance considerations for LLM pretraining.

Findings

01

High prediction accuracy across various scenarios

02

Effective identification of performance bottlenecks

03

Guidance for system design and deployment

Abstract

The exponential growth in LLM scales, with parameters soaring from billions to trillions, has necessitated distributed pretraining across large clusters comprising thousands to tens of thousands of devices. While hybrid parallelization strategies enable such pretraining, the vast combinatorial strategy space introduces significant optimization challenges. Traditional manual tuning methods incur prohibitive trial-and-error costs, and existing performance modeling approaches exhibit critical limitations: they fail to comprehensively account for prevalent optimization features and ignore the substantial overhead imposed by essential fault tolerance mechanisms like checkpoint recovery in long-duration pretraining. To address these gaps, we propose MoFa, a novel pretraining performance modeling framework that unifies multi-dimensional optimization features and fault tolerance. MoFa…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Software System Performance and Reliability