Adaptive Gradient Clipping for Robust Federated Learning
Youssef Allouah, Rachid Guerraoui, Nirupam Gupta, Ahmed Jellouli, Geovani Rizk, and John Stephan

TL;DR
This paper introduces Adaptive Robust Clipping (ARC), a dynamic gradient clipping method that improves robustness and convergence in federated learning, especially under adversarial attacks and data heterogeneity.
Contribution
The paper proposes ARC, an adaptive gradient clipping strategy that maintains theoretical robustness guarantees and enhances empirical performance in federated learning.
Findings
ARC improves robustness against adversarial attacks.
ARC enhances convergence in heterogeneous settings.
Experimental results confirm ARC's effectiveness on image classification tasks.
Abstract
Robust federated learning aims to maintain reliable performance despite the presence of adversarial or misbehaving workers. While state-of-the-art (SOTA) robust distributed gradient descent (Robust-DGD) methods were proven theoretically optimal, their empirical success has often relied on pre-aggregation gradient clipping. However, existing static clipping strategies yield inconsistent results: enhancing robustness against some attacks while being ineffective or even detrimental against others. To address this limitation, we propose a principled adaptive clipping strategy, Adaptive Robust Clipping (ARC), which dynamically adjusts clipping thresholds based on the input gradients. We prove that ARC not only preserves the theoretical robustness guarantees of SOTA Robust-DGD methods but also provably improves asymptotic convergence when the model is well-initialized. Extensive experiments…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. This paper is generally well-written. 2. The idea of adaptive clipping intuitively makes sense and has an excellent empirical performance in the experiments of this work. 3. Byzantine resilience in distributed learning is an important and timely topic.
Although the proposed ARC strategy is generally not hard to implement and has a good empirical performance, there are major concerns about the theoretical analysis in this paper, which I specify point by point below. 1. The theoretical results in section 3 show that $F\circ ARC$ is $(f,3\kappa)$-robust when $F$ is $(f,\kappa)$-robust (Theorem 3.2). Although the property of $ARC$ is much better than trivial clipping (as shown in Lemma 3.1), the convergence guarantee obtained from Theorem 3.2 fo
The paper proposes an adaptive method that maintains the robustness guarantees of the aggregators it employs while improving their practical performance, especially under high heterogeneity. The authors provide valuable insights into selecting the clipping threshold, demonstrating that a fixed threshold for all workers, commonly used in practice, may be inefficient in some cases and does not meet robust criteria. They also emphasize the gap between Byzantine theory and practical applications, hi
Considering the critical role that numerical evaluation plays in supporting the paper’s claims, * The paper introduces an adaptive clipping approach designed to work with any robust aggregator independently of NNM. However, the numerical results primarily showcase its effectiveness only when combined with the NNM aggregator (and it is unclear if NNM was also used in Figure 6; if so, this single example may be insufficient). Since NNM has a computational complexity of $O(dn^2)$, it would be valua
The main strengths are: -The authors propose Adaptive Robust Clipping (ARC), a new mechanism to enhance robustness in adversarial settings. -The authors show that ARC almost retains the theoretical robustness guarantees of existing Robust methods while enhancing their practical performance. -The authors validate ARC through several experiments.
The main weaknesses are: -Increased complexity produced by ARC in practical implementation -ARC performance depends on good model initialization which may degrade the performance in the case of poor initialization. Did you try some experiments to assess this? -While ARC improves robustness by adaptively clipping gradients, its thresholding could risk clipping too aggressively in certain settings, potentially discarding useful gradient information.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Distributed Sensor Networks and Detection Algorithms · Neural Networks and Applications
MethodsSparse Evolutionary Training
