Cooperative Gradient Coding

Shudi Weng; Ming Xiao; Chao Ren; and Mikael Skoglund

arXiv:2507.05230·cs.DC·July 8, 2025

Cooperative Gradient Coding

Shudi Weng, Ming Xiao, Chao Ren, and Mikael Skoglund

PDF

TL;DR

This paper introduces cooperative gradient coding (CoGC) and an enhanced decoding method GC$^+$ for distributed training, improving communication efficiency and reliability in federated learning under unreliable communication conditions.

Contribution

It proposes a novel cooperative gradient coding framework and a complementary decoding mechanism, with theoretical analysis and validation for improved robustness and efficiency.

Findings

01

CoGC eliminates dataset replication, reducing communication and computation costs.

02

GC$^+$ significantly improves system reliability by recovering information lost during decoding failures.

03

Theoretical bounds and extensive simulations validate the effectiveness of the proposed methods.

Abstract

This work studies gradient coding (GC) in the context of distributed training problems with unreliable communication. We propose cooperative GC (CoGC), a novel gradient-sharing-based GC framework that leverages cooperative communication among clients. This approach ultimately eliminates the need for dataset replication, making it both communication- and computation-efficient and suitable for federated learning (FL). By employing the standard GC decoding mechanism, CoGC yields strictly binary outcomes: either the global model is exactly recovered, or the decoding fails entirely, with no intermediate results. This characteristic ensures the optimality of the training and demonstrates strong resilience to client-to-server communication failures when the communication channels among clients are in good condition. However, it may also result in communication inefficiency and hinder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.