TL;DR
FedAdamW is a novel federated optimizer that improves training efficiency and model performance for large-scale models by addressing variance, overfitting, and convergence issues with theoretical guarantees.
Contribution
The paper introduces FedAdamW, the first federated AdamW variant with variance reduction, local correction, and convergence guarantees, tailored for large models.
Findings
Reduces communication rounds significantly.
Improves test accuracy over baselines.
Validates effectiveness on language and vision models.
Abstract
AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate ; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates (, ) at each round slows down convergence. To address these challenges, we propose the first \underline{Fed}erated \underline{AdamW} algorithm, called \texttt{FedAdamW}, for training and fine-tuning various large models. \texttt{FedAdamW} aligns local updates with the global update using both a \textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
