Byzantine-Robust Federated Learning with Learnable Aggregation Weights

Javad Parsa; Amir Hossein Daghestani; Andr\'e M. H. Teixeira; Mikael Johansson

arXiv:2511.03529·cs.LG·November 6, 2025

Byzantine-Robust Federated Learning with Learnable Aggregation Weights

Javad Parsa, Amir Hossein Daghestani, Andr\'e M. H. Teixeira, Mikael Johansson

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new federated learning method that uses learnable aggregation weights to improve robustness against malicious clients, especially in heterogeneous data environments, with proven convergence and superior experimental performance.

Contribution

It proposes a novel adaptive weighting scheme for aggregation in Byzantine-robust federated learning, jointly optimized with model parameters, with strong convergence guarantees and improved robustness.

Findings

01

Outperforms existing Byzantine-robust FL methods in heterogeneous settings

02

Achieves strong convergence guarantees under adversarial attacks

03

Demonstrates superior robustness in diverse datasets and attack scenarios

Abstract

Federated Learning (FL) enables clients to collaboratively train a global model without sharing their private data. However, the presence of malicious (Byzantine) clients poses significant challenges to the robustness of FL, particularly when data distributions across clients are heterogeneous. In this paper, we propose a novel Byzantine-robust FL optimization problem that incorporates adaptive weighting into the aggregation process. Unlike conventional approaches, our formulation treats aggregation weights as learnable parameters, jointly optimizing them alongside the global model parameters. To solve this optimization problem, we develop an alternating minimization algorithm with strong convergence guarantees under adversarial attack. We analyze the Byzantine resilience of the proposed objective. We evaluate the performance of our algorithm against state-of-the-art Byzantine-robust FL…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. The paper studies Byzantine adversarial tolerance, which is an important problem in federated learning. 2. The paper is well written and easy to follow. 3. The theoretical analysis is rigorous and the experiments are extensive to support the effectiveness of the proposed algorithm.

Weaknesses

1. Figure 1 is not clear, with confusing color to denote different algorithms. 2. How to enforce sparsity is not discussed, i.e., how to determine how many clients are malicious?

Reviewer 02Rating 8Confidence 4

Strengths

The paper presents a genuinely novel approach to Byzantine-robust federated learning by treating aggregation weights as learnable parameters. This isn't just a minor tweak to existing methods—it represents a meaningful shift in how we approach the aggregation problem. The theoretical foundation is particularly impressive, providing not just convergence guarantees but also a detailed Byzantine resilience analysis that clearly explains why the method works. What makes the contribution stand out is

Weaknesses

While the method is compelling, it does come with some practical trade-offs. The two-round communication per update is a noticeable overhead, and while the authors argue that faster convergence might compensate for this, the paper doesn't provide a conclusive analysis of the total communication cost compared to alternatives. Some of the theoretical assumptions, like the bounded gradient deviation and heterogeneity bounds, feel somewhat idealistic—in real-world non-IID settings, these assumptions

Reviewer 03Rating 2Confidence 3

Strengths

- The method seems novel and provides a new avenue for research on Byzantine robustness. - The algorithm intuition is clearly described, and many details are given on how to implement the algorithm in practice. - The added computation and communication complexities are discussed in details, which is appreciated. - The theoretical analysis is provided for the both the cases when only the sent gradients are corrupted and when also the sent loss evaluations are corrupted (even though only the la

Weaknesses

1. It seems to me there is something conceptually wrong with the proof, precisely when decomposing the error in part E3. The bound between $F$ and $\tilde{F}$ assumes that the same byzantine clients will be selected by the aggregation for either the mini-batch or full gradients. This is wrong. 2. It also seems that the proof uses the exact minimum of equation (6), whereas the algorithm provides an approximation through the first order decomposition. 3. The theoretical results are not compared w

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning