On the Outsized Importance of Learning Rates in Local Update Methods
Zachary Charles, Jakub Kone\v{c}n\'y

TL;DR
This paper analyzes local update methods in federated and meta-learning, revealing the critical role of learning rates in convergence and proposing a practical automatic decay method to improve performance.
Contribution
It provides a theoretical characterization of local update methods for quadratic objectives and introduces a new automatic learning rate decay technique.
Findings
Proper learning rate tuning can achieve near-optimal results in communication-limited settings.
The choice of client learning rate affects the surrogate loss's condition number and alignment with the true loss.
The proposed automatic learning rate decay improves empirical performance across various tasks.
Abstract
We study a family of algorithms, which we refer to as local update methods, that generalize many federated learning and meta-learning algorithms. We prove that for quadratic objectives, local update methods perform stochastic gradient descent on a surrogate loss function which we exactly characterize. We show that the choice of client learning rate controls the condition number of that surrogate loss, as well as the distance between the minimizers of the surrogate and true loss functions. We use this theory to derive novel convergence rates for federated averaging that showcase this trade-off between the condition number of the surrogate loss and its alignment with the true loss function. We validate our results empirically, showing that in communication-limited settings, proper learning rate tuning is often sufficient to reach near-optimal behavior. We also present a practical method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Domain Adaptation and Few-Shot Learning
MethodsModel-Agnostic Meta-Learning
