Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

Sameera Ramasinghe; Thalaiyasingam Ajanthan; Gil Avraham; Yan Zuo; Alexander Long

arXiv:2506.01260·cs.LG·June 3, 2025

Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, Alexander Long

PDF

Open Access

TL;DR

This paper introduces a novel compression algorithm for decentralized training of large models, significantly reducing communication costs while maintaining convergence, enabling training on low-end hardware over slow internet connections.

Contribution

It presents a new compression method for model parallelism that compresses activations and gradients with minimal overhead, facilitating scalable decentralized training.

Findings

01

Achieves up to 99% compression without convergence loss

02

Enables training billion-parameter models on low-end GPUs

03

Provides up to 100x communication efficiency improvement

Abstract

Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in data-parallel, they do not extend to model parallelism. Unlike data-parallel training, where weight gradients are exchanged, model-parallel requires compressing activations and activation gradients as they propagate through layers, accumulating compression errors. We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation with negligible memory/compute overhead. By leveraging a recursive structure in transformer networks, we predefine a low-dimensional subspace to confine the activations and gradients, allowing full reconstruction in subsequent layers. Our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Distributed systems and fault tolerance · DNA and Biological Computing