Flare: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale
Weihao Cui, Ji Zhang, Han Zhao, Chao Liu, Jian Sha, Bingsheng He, Minyi Guo, Quan Chen

TL;DR
Flare is a comprehensive diagnostic framework for large-scale distributed LLM training that automatically detects anomalies and performance issues across the entire training stack, improving reliability on thousands of GPUs.
Contribution
It introduces a scalable, full-stack diagnostic system with automatic anomaly detection tailored for large GPU clusters training LLMs, addressing gaps in existing tools.
Findings
Effective anomaly diagnosis across 6,000 GPUs
Continuous operation for over eight months
Significant improvements in pinpointing training deficiencies
Abstract
The rapid proliferation of large language models has driven the need for efficient GPU training clusters. However, it is challenging due to the frequent occurrence of training anomalies. Since existing diagnostic tools are narrowly tailored to specific issues, there are gaps in their ability to address anomalies spanning the entire training stack. In response, we introduce Flare, a diagnostic framework designed for distributed LLM training at scale. Flare first integrates a lightweight tracing daemon for full-stack and backend-extensible tracing. Additionally, it features a diagnostic engine that automatically diagnoses anomalies, with a focus on performance regressions. The deployment of Flare across 6,000 GPUs has demonstrated significant improvements in pinpointing deficiencies in real-world scenarios, with continuous operation for over eight months.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · Advanced Neural Network Applications · Brain Tumor Detection and Classification
