Gradient Correction in Federated Learning with Adaptive Optimization

Evan Chen; Shiqiang Wang; Jianing Zhang; Dong-Jun Han; Chaoyue Liu; Christopher Brinton

arXiv:2502.02727·cs.LG·May 20, 2025

Gradient Correction in Federated Learning with Adaptive Optimization

Evan Chen, Shiqiang Wang, Jianing Zhang, Dong-Jun Han, Chaoyue Liu, Christopher Brinton

PDF

Open Access 3 Reviews

TL;DR

This paper introduces FAdamGC, a novel adaptive federated learning algorithm that effectively incorporates gradient correction to handle data heterogeneity, improving convergence and efficiency over existing methods.

Contribution

The paper presents the first adaptive optimizer with integrated gradient correction for federated learning, along with a rigorous convergence analysis and empirical validation.

Findings

01

FAdamGC outperforms existing methods in communication and computation costs.

02

The algorithm achieves better convergence rates under non-convex settings.

03

Gradient correction improves performance in heterogeneous data environments.

Abstract

In federated learning (FL), model training performance is strongly impacted by data heterogeneity across clients. Client-drift compensation methods have recently emerged as a solution to this issue, introducing correction terms into local model updates. To date, these methods have only been considered under stochastic gradient descent (SGD)-based model training, while modern FL frameworks also employ adaptive optimizers (e.g., Adam) for improved convergence. However, due to the complex interplay between first and second moments found in most adaptive optimization methods, naively injecting correction terms can lead to performance degradation in heterogeneous settings. In this work, we propose {\tt FAdamGC}, the first algorithm to integrate drift compensation into adaptive federated optimization. The key idea of {\tt FAdamGC} is injecting a pre-estimation correction term that aligns with…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

The paper provides sufficient motivation for all the ideations. It justifies using Adam at the client level because FedAdam is more sensitive to gradient noise. The paper also explains why naively adapting a SCAFFOLD-like strategy for client drift mitigation in adaptive algorithms yields suboptimal results, and it carefully constructs a control variate strategy for the Adam optimiser. The paper proposes a drift compensation method for the Adam optimizer used by clients in federated learning, b

Weaknesses

The paper noted that drift compensations should intuitively include the fixed-point structure needed for consistent convergence across different clients. However, the fixed-point structure discussed relates to a single client, since the update rule used for fixed-point problem formulation corresponds to each client's local optimizer. A discussion on how this fixed-point structure is maintained in the multi-node FL optimization problem would clarify how this intuition applies to FL, where multipl

Reviewer 02Rating 6Confidence 2

Strengths

The main contributions include (1) an algorithm with a novel gradient correction mechanism designed to stabilize updates across heterogeneous clients, (2) a theoretical convergence analysis that supports the proposed method, and (3) experimental results that demonstrate its effectiveness. The paper is clearly written and well-organized, making it easy to follow the motivation, methodology, and results. The authors present their technical novelty in a transparent and understandable manner.

Weaknesses

That said, the literature review appears rather limited. The discussion could be strengthened by including a broader comparison with recent federated optimization approaches and by justifying the choice of Adam as the base optimizer. In modern machine learning applications, alternative optimizers such as Muon or other momentum-based methods often demonstrate superior empirical performance, so the rationale for building upon Adam should be clarified. In terms of experiments, the reported test ac

Reviewer 03Rating 4Confidence 4

Strengths

* The paper is well-written and clearly organized, making the proposed method and its analysis easy to follow. * The experimental evaluation is comprehensive, covering both image classification and large language model (LLM) fine-tuning tasks against several relevant baselines. * The core idea of injecting a gradient correction term before moment accumulation is intuitive and well-motivated.

Weaknesses

* The method requires transmitting both model parameters (x) and correction terms (y) in each round, which doubles the communication payload compared to methods like LocalAdam. While the paper provides Simulated Run Time (SRT) and total volume metrics, a direct plot of accuracy versus communication bits/volume would more clearly illustrate the efficiency trade-offs. * The paper empirically shows that β₂ > 0 is better than β₂ = 0, but an intuitive explanation for why the second-moment information

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Privacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques