Strix: Re-thinking NPU Reliability from a System Perspective
Jiapeng Guan, Jie Zhang, Hao Zhou, Ran Wei, Dean You, Hui Wang, Yingquan Wang, Tinglue Wang, Xudong Zhao, Jing Li, Zhe Jiang

TL;DR
Strix is a comprehensive NPU reliability framework that re-partitions the system, identifies failure modes, and applies targeted safeguards to improve fault tolerance with minimal overhead.
Contribution
It introduces a full-stack approach to NPU reliability, spanning micro-architecture, ISA, and programming, with a novel re-partitioning strategy and targeted safeguards.
Findings
Achieves sub-microsecond fault localisation and correction.
Imposes only 1.04× slowdown with minimal hardware overhead.
Addresses reliability gaps in existing coarse-grained approaches.
Abstract
DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04 slowdown and minimal hardware overhead.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
