Stragglers Can Contribute More: Uncertainty-Aware Distillation for Asynchronous Federated Learning
Yujia Wang, Fenglong Ma, Jinghui Chen

TL;DR
FedEcho introduces an uncertainty-aware distillation method in asynchronous federated learning, effectively balancing the contributions of straggler and faster clients to improve model performance amid delays and data heterogeneity.
Contribution
The paper presents FedEcho, a novel framework that dynamically assesses and adjusts client prediction reliability to enhance asynchronous federated learning.
Findings
FedEcho outperforms existing baselines in diverse experiments.
It effectively mitigates outdated update impacts and client bias.
The method maintains robust performance without accessing private data.
Abstract
Asynchronous federated learning (FL) has recently gained attention for its enhanced efficiency and scalability, enabling local clients to send model updates to the server at their own pace without waiting for slower participants. However, such a design encounters significant challenges, such as the risk of outdated updates from straggler clients degrading the overall model performance and the potential bias introduced by faster clients dominating the learning process, especially under heterogeneous data distributions. Existing methods typically address only one of these issues, creating a conflict where mitigating the impact of outdated updates can exacerbate the bias created by faster clients, and vice versa. To address these challenges, we propose FedEcho, a novel framework that incorporates uncertainty-aware distillation to enhance the asynchronous FL performances under large…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper tackles a relevant problem in federated learning where there are a number of stragglers and often waiting for them can be a communication bottleneck, and their stale updates can hurt the global model performance. - The paper includes convergence guarantees for the method to show that the method's theoretical convergence aligns with previous work in asynchronous FL. - The paper is clearly written and easy to understand, with additional ablation studies of FedEcho on the different hy
- The novelty is rather limited in the sense that the main contribution here is using the loss function in (3) for distillation with the mixing weight $\alpha$ that balances between hard labels and soft ones. Such approach has already been proposed in many other works [1-2], although not directly in the context of asynchronous FL. Moreover, with just (3), it seems like a weak argument to claim that uncertainty is leveraged here, especially when tuning $\alpha$ can seem to be a bit tricky here.
1. The use of uncertainty-aware distillation to extract knowledge from stragglers without direct parameter mixing is an elegant and effective solution to the core staleness-vs-bias problem in asynchronous FL. 2. The method is convincingly validated against strong baselines across diverse tasks, including vision, NLP, and generative language models, demonstrating robust and significant performance gains.
1. Unclear Algorithmic Rationale and Complexity: (This addresses your second question). The algorithm presented is confusing because it seems to perform two separate update steps. In Line 11, the global model is updated via standard parameter averaging (x_bt+1 = xt + η∆t), which directly incorporates stale updates—the very problem the paper aims to avoid. Then, in a second phase (Lines 12-16), this newly updated model is further refined via distillation. The paper does not adequately justify why
The paper is well written with clear motivation and description of the proposal. The proposed entropy-based weight adaptation between KL and CE is reasonable and interesting. Though there are entropy-based works in the field, one of which is also mentioned in Related works (Itahara et al 2021), the proposal is still somewhat novel. The authors have provided both thoeretical analysis and emprical tests to show the performance boundries.
The choice of alpha min and max values lack of discussion. The intuition behind entropy and alpha is reasonable, but lack deeper and wider discussion and exploration. The limitation of the work is not discussed.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Caching and Content Delivery · Stochastic Gradient Optimization Techniques
