Gradient Correction in Federated Learning with Adaptive Optimization
Evan Chen, Shiqiang Wang, Jianing Zhang, Dong-Jun Han, Chaoyue Liu, Christopher Brinton

TL;DR
This paper introduces FAdamGC, a novel adaptive federated learning algorithm that effectively incorporates gradient correction to handle data heterogeneity, improving convergence and efficiency over existing methods.
Contribution
The paper presents the first adaptive optimizer with integrated gradient correction for federated learning, along with a rigorous convergence analysis and empirical validation.
Findings
FAdamGC outperforms existing methods in communication and computation costs.
The algorithm achieves better convergence rates under non-convex settings.
Gradient correction improves performance in heterogeneous data environments.
Abstract
In federated learning (FL), model training performance is strongly impacted by data heterogeneity across clients. Client-drift compensation methods have recently emerged as a solution to this issue, introducing correction terms into local model updates. To date, these methods have only been considered under stochastic gradient descent (SGD)-based model training, while modern FL frameworks also employ adaptive optimizers (e.g., Adam) for improved convergence. However, due to the complex interplay between first and second moments found in most adaptive optimization methods, naively injecting correction terms can lead to performance degradation in heterogeneous settings. In this work, we propose {\tt FAdamGC}, the first algorithm to integrate drift compensation into adaptive federated optimization. The key idea of {\tt FAdamGC} is injecting a pre-estimation correction term that aligns with…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper provides sufficient motivation for all the ideations. It justifies using Adam at the client level because FedAdam is more sensitive to gradient noise. The paper also explains why naively adapting a SCAFFOLD-like strategy for client drift mitigation in adaptive algorithms yields suboptimal results, and it carefully constructs a control variate strategy for the Adam optimiser. The paper proposes a drift compensation method for the Adam optimizer used by clients in federated learning, b
The paper noted that drift compensations should intuitively include the fixed-point structure needed for consistent convergence across different clients. However, the fixed-point structure discussed relates to a single client, since the update rule used for fixed-point problem formulation corresponds to each client's local optimizer. A discussion on how this fixed-point structure is maintained in the multi-node FL optimization problem would clarify how this intuition applies to FL, where multipl
The main contributions include (1) an algorithm with a novel gradient correction mechanism designed to stabilize updates across heterogeneous clients, (2) a theoretical convergence analysis that supports the proposed method, and (3) experimental results that demonstrate its effectiveness. The paper is clearly written and well-organized, making it easy to follow the motivation, methodology, and results. The authors present their technical novelty in a transparent and understandable manner.
That said, the literature review appears rather limited. The discussion could be strengthened by including a broader comparison with recent federated optimization approaches and by justifying the choice of Adam as the base optimizer. In modern machine learning applications, alternative optimizers such as Muon or other momentum-based methods often demonstrate superior empirical performance, so the rationale for building upon Adam should be clarified. In terms of experiments, the reported test ac
* The paper is well-written and clearly organized, making the proposed method and its analysis easy to follow. * The experimental evaluation is comprehensive, covering both image classification and large language model (LLM) fine-tuning tasks against several relevant baselines. * The core idea of injecting a gradient correction term before moment accumulation is intuitive and well-motivated.
* The method requires transmitting both model parameters (x) and correction terms (y) in each round, which doubles the communication payload compared to methods like LocalAdam. While the paper provides Simulated Run Time (SRT) and total volume metrics, a direct plot of accuracy versus communication bits/volume would more clearly illustrate the efficiency trade-offs. * The paper empirically shows that β₂ > 0 is better than β₂ = 0, but an intuitive explanation for why the second-moment information
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Privacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques
