LLMPrism: Black-box Performance Diagnosis for Production LLM Training   Platforms

Zhihan Jiang; Rui Ren; Guangba Yu; Yulun Wu; Wenwei Gu; Yichen Li,; Yujie Huang; Cong Feng; Zengyin Yang; Yongqiang Yang; Michael R. Lyu

arXiv:2505.00342·cs.SE·May 2, 2025

LLMPrism: Black-box Performance Diagnosis for Production LLM Training Platforms

Zhihan Jiang, Rui Ren, Guangba Yu, Yulun Wu, Wenwei Gu, Yichen Li,, Yujie Huang, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu

PDF

TL;DR

This paper introduces LLMPrism, a black-box system that uses network flow data to non-intrusively monitor, reconstruct training timelines, and diagnose performance issues in large-scale LLM training platforms, improving efficiency and resource utilization.

Contribution

LLMPrism is the first system to enable non-intrusive, accurate performance diagnosis for production LLM training platforms using network flow data.

Findings

01

Achieves timeline reconstruction error within 0.3%

02

Effectively diagnoses various performance issues

03

Deployed on large-scale production platform since Oct. 2024

Abstract

Large Language Models (LLMs) have brought about revolutionary changes in diverse fields, rendering LLM training of utmost importance for modern enterprises. To meet this demand, multi-tenant large-scale LLM training platforms have been built to offer LLM training services. Nevertheless, due to the complexity and synchronous nature of LLM training process, performance issues occur frequently and can result in substantial resource wastage. The limited visibility from the perspective of platform providers impedes existing profiling methods and poses challenges to the monitoring and diagnosis of the performance of LLM training jobs. For the first time, this paper proposes the utilization of underlying network flow data to reconstruct the training timelines of jobs based on the distinct characteristics in the LLM training procedure. We design LLMPrism, the first black-box performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.