Towards Efficient and Accurate Detection of On-Chip Fail-Slow Failures for Many-Core Accelerators
Junchi Wu, Xinfei Wan, Zhuoran Li, Yuyang Jin, Guangyu Sun, Yun Liang, Diyu Zhou, Youwei Zhuo

TL;DR
This paper presents SLOTH, a lightweight hardware-aware framework for efficient on-chip fail-slow failure detection in many-core accelerators, significantly reducing storage overhead while maintaining high detection accuracy.
Contribution
SLOTH introduces a novel combination of workload-aware instrumentation, trace compression, and topology-aware ranking for practical fail-slow detection on-chip.
Findings
SLOTH reduces storage overhead by 115.9× on average.
Achieves 86.77% detection accuracy with 12.11% FPR.
Scales effectively across various many-core architectures.
Abstract
Many-core accelerators are essential for high-performance deep learning, but their performance is undermined by widespread fail-slow failures. Detecting such failures on-chip is challenging, as prior methods from distributed systems are unsuitable due to strict memory limits and their inability to track failures across the hardware topology. This paper introduces SLOTH, a lightweight, hardware-aware framework for practical on-chip fail-slow detection in many-core accelerators. SLOTH combines workload-aware instrumentation for operator-level monitoring with minimal overhead, on-the-fly trace compression to operate within kilobytes of memory, and a novel topology-aware ranking algorithm to pinpoint a failure's root cause. We evaluate SLOTH on a wide range of representative DNN workloads. The results demonstrate that SLOTH reduces the storage overhead by an average of 115.9, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
