Towards Efficient and Accurate Detection of On-Chip Fail-Slow Failures for Many-Core Accelerators

Junchi Wu; Xinfei Wan; Zhuoran Li; Yuyang Jin; Guangyu Sun; Yun Liang; Diyu Zhou; Youwei Zhuo

arXiv:2510.24112·cs.AR·February 26, 2026

Towards Efficient and Accurate Detection of On-Chip Fail-Slow Failures for Many-Core Accelerators

Junchi Wu, Xinfei Wan, Zhuoran Li, Yuyang Jin, Guangyu Sun, Yun Liang, Diyu Zhou, Youwei Zhuo

PDF

TL;DR

This paper presents SLOTH, a lightweight hardware-aware framework for efficient on-chip fail-slow failure detection in many-core accelerators, significantly reducing storage overhead while maintaining high detection accuracy.

Contribution

SLOTH introduces a novel combination of workload-aware instrumentation, trace compression, and topology-aware ranking for practical fail-slow detection on-chip.

Findings

01

SLOTH reduces storage overhead by 115.9× on average.

02

Achieves 86.77% detection accuracy with 12.11% FPR.

03

Scales effectively across various many-core architectures.

Abstract

Many-core accelerators are essential for high-performance deep learning, but their performance is undermined by widespread fail-slow failures. Detecting such failures on-chip is challenging, as prior methods from distributed systems are unsuitable due to strict memory limits and their inability to track failures across the hardware topology. This paper introduces SLOTH, a lightweight, hardware-aware framework for practical on-chip fail-slow detection in many-core accelerators. SLOTH combines workload-aware instrumentation for operator-level monitoring with minimal overhead, on-the-fly trace compression to operate within kilobytes of memory, and a novel topology-aware ranking algorithm to pinpoint a failure's root cause. We evaluate SLOTH on a wide range of representative DNN workloads. The results demonstrate that SLOTH reduces the storage overhead by an average of 115.9 $\times$ , while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.