SysOM-AI: Continuous Cross-Layer Performance Diagnosis for Production AI Training
Yusheng Zheng, Wenan Mao, Shuyi Cheng, Fuqiu Feng, Guangshui Li, Zhaoyan Liao, Yongzhuo Huang, Zhenwei Xiao, Yuqing Li, Andi Quinn, Tao Ma

TL;DR
SysOM-AI is a continuous, low-overhead cross-layer observability system for diagnosing performance issues in large-scale AI training, significantly reducing diagnosis time.
Contribution
It introduces a novel system combining OS-level instrumentation and layered diagnosis for real-time performance troubleshooting in production AI training.
Findings
Diagnosed 94 production issues at Alibaba over a year.
Reduced median diagnosis time from days to about 10 minutes.
Achieved less than 0.4% overhead in continuous monitoring.
Abstract
Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days. We argue that continuous, cross-layer observability enabled by OS-level instrumentation and layered differential diagnosis is necessary to address this gap. We introduce SysOM-AI, a production observability system that continuously integrates CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via adaptive hybrid stack unwinding and eBPF-based tracing, incurring less than 0.4% overhead. Deployed at Alibaba across over 80,000 GPUs for more than one year, SysOM-AI helped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
