LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms
Zhihan Jiang, Rui Ren, Guangba Yu, Yulun Wu, Wenwei Gu, Yichen Li,, Yujie Huang, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

TL;DR
This paper introduces LLMPrism, a black-box system that uses network flow data to non-intrusively monitor, reconstruct training timelines, and diagnose performance issues in large-scale LLM training platforms, improving efficiency and resource utilization.
Contribution
LLMPrism is the first system to enable non-intrusive, accurate performance diagnosis for production LLM training platforms using network flow data.
Findings
Achieves timeline reconstruction error within 0.3%
Effectively diagnoses various performance issues
Deployed on large-scale production platform since Oct. 2024
Abstract
Large Language Models (LLMs) have brought about revolutionary changes in diverse fields, rendering LLM training of utmost importance for modern enterprises. To meet this demand, multi-tenant large-scale LLM training platforms have been built to offer LLM training services. Nevertheless, due to the complexity and synchronous nature of LLM training process, performance issues occur frequently and can result in substantial resource wastage. The limited visibility from the perspective of platform providers impedes existing profiling methods and poses challenges to the monitoring and diagnosis of the performance of LLM training jobs. For the first time, this paper proposes the utilization of underlying network flow data to reconstruct the training timelines of jobs based on the distinct characteristics in the LLM training procedure. We design LLMPrism, the first black-box performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
