The Quest for Reliable AI Accelerators: Cross-Layer Evaluation and Design Optimization
Meng Li, Tong Xie, Zuodong Zhang, Runsheng Wang

TL;DR
This paper introduces cross-layer analysis tools and design strategies to improve the reliability and efficiency of AI accelerators facing nanoscale aging and variation effects.
Contribution
It presents a systematic framework for reliability-aware AI accelerator design, integrating cross-layer modeling and workload-specific optimizations.
Findings
Developed aging and variation-aware timing analysis tools
Optimized dataflow for critical input pattern reduction
Designed resilient architectures for large language models
Abstract
As the CMOS technology pushes to the nanoscale, aging effects and process variations have become increasingly pronounced, posing significant reliability challenges for AI accelerators. Traditional guardband-based design approaches, which rely on pessimistic timing margin, sacrifice significant performance and computational efficiency, rendering them inadequate for high-performance AI computing demands. Current reliability-aware AI accelerator design faces two core challenges: (1) the lack of systematic cross-layer analysis tools to capture coupling reliability effects across device, circuit, architecture, and application layers; and (2) the fundamental trade-off between conventional reliability optimization and computational efficiency. To address these challenges, this paper systematically presents a series of reliability-aware accelerator designs, encompassing (1) aging and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemiconductor materials and devices · Radiation Effects in Electronics · Low-power high-performance VLSI design
