Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness
Eric Zelikman, Qian Huang, Percy Liang, Nick Haber, Noah D. Goodman

TL;DR
This paper introduces a low-bandwidth decentralized language model fine-tuning method using shared randomness, exchanging only single-byte gradients, which reduces communication costs and enhances privacy.
Contribution
It extends SPSA-based distributed fine-tuning with shared randomness, enabling highly communication-efficient and privacy-preserving model updates in decentralized settings.
Findings
Significantly reduces communication bandwidth for distributed training.
Supports dynamic addition/removal of machines during training.
Maintains memory efficiency and inference-only advantages.
Abstract
Language model training in distributed settings is limited by the communication cost of gradient exchanges. In this short note, we extend recent work from Malladi et al. (2023), using shared randomness to perform distributed fine-tuning with low bandwidth. The method is a natural decentralized extension of memory-efficient Simultaneous Perturbation Stochastic Approximation (SPSA). Each iteration, each machine seeds a Random Number Generator (RNG) to perform local reproducible perturbations on model weights and calculate and exchange scalar projected gradients, which are then used to update each model. By using a (machine, sample) identifier as the random seed, each model can regenerate one another's perturbations. As machines only exchange single-byte projected gradients, this is highly communication efficient. There are also potential privacy benefits, as projected gradients may be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Privacy-Preserving Technologies in Data
