Strix: Re-thinking NPU Reliability from a System Perspective

Jiapeng Guan; Jie Zhang; Hao Zhou; Ran Wei; Dean You; Hui Wang; Yingquan Wang; Tinglue Wang; Xudong Zhao; Jing Li; Zhe Jiang

arXiv:2604.10484·cs.AR·April 14, 2026

Strix: Re-thinking NPU Reliability from a System Perspective

Jiapeng Guan, Jie Zhang, Hao Zhou, Ran Wei, Dean You, Hui Wang, Yingquan Wang, Tinglue Wang, Xudong Zhao, Jing Li, Zhe Jiang

PDF

TL;DR

Strix is a comprehensive NPU reliability framework that re-partitions the system, identifies failure modes, and applies targeted safeguards to improve fault tolerance with minimal overhead.

Contribution

It introduces a full-stack approach to NPU reliability, spanning micro-architecture, ISA, and programming, with a novel re-partitioning strategy and targeted safeguards.

Findings

01

Achieves sub-microsecond fault localisation and correction.

02

Imposes only 1.04× slowdown with minimal hardware overhead.

03

Addresses reliability gaps in existing coarse-grained approaches.

Abstract

DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04 $\times$ slowdown and minimal hardware overhead.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.