Momentum Benefits Non-IID Federated Learning Simply and Provably
Ziheng Cheng, Xinmeng Huang, Pengfei Wu, Kun Yuan

TL;DR
This paper demonstrates that incorporating momentum into federated learning algorithms like FedAvg and SCAFFOLD improves convergence, especially under data heterogeneity and partial participation, with provable guarantees and state-of-the-art results.
Contribution
It introduces momentum-based enhancements for FedAvg and SCAFFOLD, enabling convergence without bounded heterogeneity assumptions and improving convergence rates.
Findings
Momentum allows FedAvg to converge without bounded heterogeneity.
Momentum enables SCAFFOLD to converge faster under partial participation.
New variance-reduced methods achieve state-of-the-art convergence rates.
Abstract
Federated learning is a powerful paradigm for large-scale machine learning, but it faces significant challenges due to unreliable network connections, slow communication, and substantial data heterogeneity across clients. FedAvg and SCAFFOLD are two prominent algorithms to address these challenges. In particular, FedAvg employs multiple local updates before communicating with a central server, while SCAFFOLD maintains a control variable on each client to compensate for ``client drift'' in its local updates. Various methods have been proposed to enhance the convergence of these two algorithms, but they either make impractical adjustments to the algorithmic structure or rely on the assumption of bounded data heterogeneity. This paper explores the utilization of momentum to enhance the performance of FedAvg and SCAFFOLD. When all clients participate in the training process, we…
Peer Reviews
Decision·ICLR 2024 poster
1. This work overcomes one of the most common problem in FL analysis, the data heterogeneity issue. Although a lot of works in literature analyzes the convergence result of the two algorithm, most of the works have bounded heterogeneity assumptions. This is the most basic problem in FL analysis. This work utilizes momentum method to overcome the difficulty. 2. The experiment result is encouraging and directly validate the theory.
The major concern is novelty. FedAvg and SCALFFOLD are well-known methods in FL. Momentum method is also a popular optimization algorithm. Thus the algorithm design lacks novelty. Further, some work has analyzed the performance of FedAvg with Adam update, e.g, Reddi, Sashank, et al. "Adaptive federated optimization." arXiv preprint arXiv:2003.00295 (2020). Adam algorithm is closely related to SGD with momentum, thus the proposed analysis lacks novelty.
* **SOTA CV rates:** State-of-the-art convergence rates are obtained for the introduced methods. * **No data heterogeneity assumption**: The proof technique gets rid of the bounded data heterogeneity assumption, improving theoretical convergence rates and hinting that the method mitigates the impact of arbitrary data heterogeneity. * **No additional uplink load**: The introduced momentum term is simple, its effect is intuitive to understand, and does not lead to any additional client to server c
* **Algorithm not new**: Contrary to what is claimed in section 3.1 (*"resulting in the new algorithm FEDAVG-M"*), the added momentum is not new: FedCM [[1]](https://arxiv.org/pdf/2106.10874.pdf) is exactly the same algorithm as FedAVG-M, although their theoretical analysis does use the bounded heterogeneity assumptions. Comparison with FedCM rates is lacking in Table 1. * **Surprising rates for the VR variants**: [[2]](https://link.springer.com/article/10.1007/s10107-022-01822-7) states that *"
1. This paper is easy to follow. 2. The incorporation of momentum enhances the convergence rates of both FedAvg and SCAFFOLD. And this improvement has been substantiated through both theoretical analysis and experimental validation.
1. The final convergence rate achieved by the authors does not sufficiently account for the impact of the momentum coefficient. Please clarify this issue. 2. In fact, FedDyn [1] demonstrates a faster convergence rate compared to the authors' findings in this paper, which is also without the need of clients’ variance assumptions. This observation may highlight the potential limitations in the author's theoretical contributions. 3. The authors' work seriously lacks comparative experiments, includi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
