FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance
Yujia Zhai, Elisabeth Giem, Quan Fan, Kai Zhao, Jinyang Liu, Zizhong, Chen

TL;DR
FT-BLAS is a high-performance BLAS library that incorporates online fault tolerance, maintaining competitive speed while detecting and correcting soft errors during linear algebra computations.
Contribution
This work introduces FT-BLAS, the first BLAS implementation with integrated online fault tolerance that achieves high performance through assembly optimization and kernel fusion.
Findings
FT-BLAS outperforms Intel MKL, OpenBLAS, and BLIS in speed.
FT-BLAS maintains accuracy under hundreds of injected errors per minute.
High reliability with minimal performance overhead.
Abstract
Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate fault tolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an Algorithm-Based Fault Tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
