Grid-like Error-Correcting Codes for Matrix Multiplication with Better Correcting Capability

Hao Shi; Zhengyi Jiang; Zhongyi Huang; Bo Bai; Gong Zhang; and Hanxu Hou

arXiv:2508.04355·cs.IT·August 7, 2025

Grid-like Error-Correcting Codes for Matrix Multiplication with Better Correcting Capability

Hao Shi, Zhengyi Jiang, Zhongyi Huang, Bo Bai, Gong Zhang, and Hanxu Hou

PDF

TL;DR

This paper presents a grid-like error-correcting code for matrix multiplication that detects and corrects multiple errors, significantly improving fault tolerance in distributed deep learning training with minimal computational overhead.

Contribution

A novel grid-based error-correcting coding framework specifically designed for matrix multiplication, enhancing error correction capabilities and fault tolerance in large-scale computations.

Findings

01

Deterministic correction of up to two errors across three matrices

02

Achieves 100% reliability in error correction

03

Only 24% computational overhead on GPU architectures

Abstract

Matrix multiplication over the real field constitutes a foundational operation in the training of deep learning models, serving as a computational cornerstone for both forward and backward propagation processes. However, the presence of silent data corruption (SDC) in large-scale distributed training environments poses a significant threat to model convergence and predictive accuracy, particularly when such errors manifest during matrix multiplication. Due to their transient and non-intrusive nature, these errors often evade detection, allowing them to propagate and accumulate over time, ultimately leading to substantial degradation in model performance. In this paper, we introduce a novel error-correcting coding framework specifically tailored for matrix multiplication operations. Our proposed framework is designed to detect and correct multiple computational errors that may arise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.