High dimensional theory of two-phase optimizers
Atish Agarwala

TL;DR
This paper analyzes high-dimensional two-phase optimizers, especially LA-DiLoCo, revealing their noise tradeoffs, hyperparameter effects, and potential for acceleration, offering insights into their advantages over traditional methods.
Contribution
It provides a theoretical analysis of LA-DiLoCo, a two-phase optimizer, highlighting its noise characteristics, hyperparameter tuning, and acceleration potential with momentum.
Findings
LA-DiLoCo offers a different signal-noise tradeoff than SGD.
Multi-worker LA-DiLoCo generates more noise, but hyperparameters can mitigate this.
Stacking momentum operators can accelerate convergence, especially with Nesterov momentum.
Abstract
The trend towards larger training setups has brought a renewed interest in partially asynchronous two-phase optimizers which optimize locally and then synchronize across workers. Additionally, recent work suggests that the one-worker version of one of these algorithms, DiLoCo, shows promising results as a (synchronous) optimizer. Motivated by these studies we present an analysis of LA-DiLoCo, a simple member of the DiLoCo family, on a high-dimensional linear regression problem. We show that the one-worker variant, LA, provides a different tradeoff between signal and noise than SGD, which is beneficial in many scenarios. We also show that the multi-worker version generates more noise than the single worker version, but that this additional noise generation can be ameliorated by appropriate choice of hyperparameters. We conclude with an analysis of SLA -- LA with momentum -- and show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
