From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

Daemyung Kang; Eunjin Hwang; Hanjeong Lee; HyeokJin Kim; Hyunhoi Koo; Jeongkyu Shin; Jeongseok Kang; Jihyun Kang; Joongi Kim; Junbum Lee; Jungseung Yang; Kyujin Cho; Youngsook Song

arXiv:2605.09370·cs.DC·May 12, 2026

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

Daemyung Kang, Eunjin Hwang, Hanjeong Lee, HyeokJin Kim, Hyunhoi Koo, Jeongkyu Shin, Jeongseok Kang, Jihyun Kang, Joongi Kim, Junbum Lee, Jungseung Yang, Kyujin Cho, Youngsook Song

PDF

TL;DR

This paper provides an empirical analysis of large-scale distributed AI training, focusing on failure detection, diagnosis, and recovery strategies based on 55 days of operational data from a 504-GPU cluster.

Contribution

It offers a comprehensive operational study of production-scale AI training, highlighting failure detection, bottleneck diagnosis, and recovery improvements in a multi-party environment.

Findings

01

Achieved a 10/10 failure detection rate with low false positives.

02

Identified NFS RPC saturation as the bandwidth paradox affecting GPU VRAM to storage.

03

Auto-retry recovery chains outperform manual recovery, with a success rate of 33.3%.

Abstract

Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however, remains scarce. This technical report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The cluster operates within a cross-organizational environment in which five parties (SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data) share a unified monitoring pipeline. This arrangement enabled joint diagnosis of a 60-node-scale storage I/O bottleneck that did not appear at 2-4-node scale, a production-scale phenomenon no single team could isolate alone. Drawing on a months-long pre-training campaign, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.