Centroid Approximation for Byzantine-Tolerant Federated Learning
M\'elanie Cambus, Darya Melnyk, Tijana Milentijevi\'c, Stefan Schmid

TL;DR
This paper investigates the robustness of federated learning against Byzantine faults, establishing bounds on centroid approximation and proposing a new algorithm with proven approximation guarantees.
Contribution
It provides the first lower bound on centroid approximation under box validity and introduces a new algorithm achieving tight approximation bounds.
Findings
Lower bound of min{(n-t)/t, sqrt(d)} on centroid approximation
New algorithm achieves a sqrt(2d)-approximation under convex validity
Bounds are achievable in distributed peer-to-peer settings
Abstract
Federated learning allows each client to keep its data locally when training machine learning models in a distributed setting. Significant recent research established the requirements that the input must satisfy in order to guarantee convergence of the training loop. This line of work uses averaging as the aggregation rule for the training models. In particular, we are interested in whether federated learning is robust to Byzantine behavior, and observe and investigate a tradeoff between the average/centroid and the validity conditions from distributed computing. We show that the various validity conditions alone do not guarantee a good approximation of the average. Furthermore, we show that reaching good approximation does not give good results in experimental settings due to possible Byzantine outliers. Our main contribution is the first lower bound of …
Peer Reviews
Decision·Submitted to ICLR 2026
+ The theoretical analysis is sound and establishes nearly tight upper and lower bounds on centroid approximation under different validity conditions, extending classical results from distributed computing to FL. + The angle of using the centroid approximation metric as a way to analyze Byzantine robustness is interesting. The paper clearly explains the geometric intuition. On the other hand, the idea of not excluding the Byzantine clients (unlike other aggregation methods) and yet achieving ro
+ The safe area assumption (t < n/(d+1)) in Definition 2.8 is quite impractical for most FL scenarios, collapsing to t=0 in realistic settings where d is large. Then, some of the results, like Lemma 3.4, although mathematically valid, has a limited relevance for low-dimensional problems. + Some of the assumptions are also restrictive for many practical FL scenarios: the paper assumes synchronous communication, equal local training data sizes, or static participation. It helps to simplify the the
1. The paper provides the first lower bound of $\min \{n/t - 1, \sqrt{d} \}$ for centroid approximation under box validity and an improved upper bound of $2\min \{n , \sqrt{d} \}$ establishing nearly tight limits. 2. Bridging approximate agreement theory and federated learning offers a good perspective and strengthens theoretical rigor in Byzantine robustness. Extension to a peer-to-peer network setting is provided.
The main concern is in the simulation part. The experimental setting (30 clients) is small. It remains unclear how the proposed bounds or algorithms behave on large-scale FL systems with realistic models (e.g., CNNs, Transformers).
1. The definitions of candidate centroids, minimum covering ball, and the centroid approximation ratio are crisp and useful for analyzing Byzantine aggregation beyond worst-case distances. 2. Upper/lower bounds under weak/strong/box/convex validity are tabulated; the new analysis for box validity (including n<d) and the tight 2d bound for convex validity are valuable. 3. The extension argument to peer-to-peer (interactive consistency → identical inputs → deterministic aggregation) is straightfo
1. The setup treats all clients equally (ignores sample-size heterogeneity) “to restrict Byzantine power,” which deviates from standard sample-count weighting in FL. Please justify this modeling choice and discuss implications for deployment, including whether your bounds/algorithms still hold under non-uniform weights or can be adapted with importance weighting. 2. (Eq. 3) is stated for general n,d,t but never checks corner cases like t=0, n≤2t, n<d, or n=d+1. The statement should list validity
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Wireless Communication Security Techniques · Advanced Memory and Neural Computing
