Byzantine-Robust Federated Learning with Learnable Aggregation Weights
Javad Parsa, Amir Hossein Daghestani, Andr\'e M. H. Teixeira, Mikael Johansson

TL;DR
This paper introduces a new federated learning method that uses learnable aggregation weights to improve robustness against malicious clients, especially in heterogeneous data environments, with proven convergence and superior experimental performance.
Contribution
It proposes a novel adaptive weighting scheme for aggregation in Byzantine-robust federated learning, jointly optimized with model parameters, with strong convergence guarantees and improved robustness.
Findings
Outperforms existing Byzantine-robust FL methods in heterogeneous settings
Achieves strong convergence guarantees under adversarial attacks
Demonstrates superior robustness in diverse datasets and attack scenarios
Abstract
Federated Learning (FL) enables clients to collaboratively train a global model without sharing their private data. However, the presence of malicious (Byzantine) clients poses significant challenges to the robustness of FL, particularly when data distributions across clients are heterogeneous. In this paper, we propose a novel Byzantine-robust FL optimization problem that incorporates adaptive weighting into the aggregation process. Unlike conventional approaches, our formulation treats aggregation weights as learnable parameters, jointly optimizing them alongside the global model parameters. To solve this optimization problem, we develop an alternating minimization algorithm with strong convergence guarantees under adversarial attack. We analyze the Byzantine resilience of the proposed objective. We evaluate the performance of our algorithm against state-of-the-art Byzantine-robust FL…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper studies Byzantine adversarial tolerance, which is an important problem in federated learning. 2. The paper is well written and easy to follow. 3. The theoretical analysis is rigorous and the experiments are extensive to support the effectiveness of the proposed algorithm.
1. Figure 1 is not clear, with confusing color to denote different algorithms. 2. How to enforce sparsity is not discussed, i.e., how to determine how many clients are malicious?
The paper presents a genuinely novel approach to Byzantine-robust federated learning by treating aggregation weights as learnable parameters. This isn't just a minor tweak to existing methods—it represents a meaningful shift in how we approach the aggregation problem. The theoretical foundation is particularly impressive, providing not just convergence guarantees but also a detailed Byzantine resilience analysis that clearly explains why the method works. What makes the contribution stand out is
While the method is compelling, it does come with some practical trade-offs. The two-round communication per update is a noticeable overhead, and while the authors argue that faster convergence might compensate for this, the paper doesn't provide a conclusive analysis of the total communication cost compared to alternatives. Some of the theoretical assumptions, like the bounded gradient deviation and heterogeneity bounds, feel somewhat idealistic—in real-world non-IID settings, these assumptions
- The method seems novel and provides a new avenue for research on Byzantine robustness. - The algorithm intuition is clearly described, and many details are given on how to implement the algorithm in practice. - The added computation and communication complexities are discussed in details, which is appreciated. - The theoretical analysis is provided for the both the cases when only the sent gradients are corrupted and when also the sent loss evaluations are corrupted (even though only the la
1. It seems to me there is something conceptually wrong with the proof, precisely when decomposing the error in part E3. The bound between $F$ and $\tilde{F}$ assumes that the same byzantine clients will be selected by the aggregation for either the mini-batch or full gradients. This is wrong. 2. It also seems that the proof uses the exact minimum of equation (6), whereas the algorithm provides an approximation through the first order decomposition. 3. The theoretical results are not compared w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
