Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism
Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, Alexander Long

TL;DR
This paper introduces a novel compression algorithm for decentralized training of large models, significantly reducing communication costs while maintaining convergence, enabling training on low-end hardware over slow internet connections.
Contribution
It presents a new compression method for model parallelism that compresses activations and gradients with minimal overhead, facilitating scalable decentralized training.
Findings
Achieves up to 99% compression without convergence loss
Enables training billion-parameter models on low-end GPUs
Provides up to 100x communication efficiency improvement
Abstract
Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in data-parallel, they do not extend to model parallelism. Unlike data-parallel training, where weight gradients are exchanged, model-parallel requires compressing activations and activation gradients as they propagate through layers, accumulating compression errors. We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation with negligible memory/compute overhead. By leveraging a recursive structure in transformer networks, we predefine a low-dimensional subspace to confine the activations and gradients, allowing full reconstruction in subsequent layers. Our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Distributed systems and fault tolerance · DNA and Biological Computing
