TL;DR
Fed-SB introduces a communication-efficient federated fine-tuning method for large language models using LoRA-SB, achieving state-of-the-art performance with significantly reduced communication costs and enhanced privacy in federated settings.
Contribution
The paper proposes Fed-SB, a novel federated fine-tuning approach utilizing LoRA-SB that reduces communication costs and improves performance in federated learning of language models.
Findings
Achieves state-of-the-art results on multiple NLP tasks.
Reduces communication costs by up to 230 times.
Enhances privacy by lowering noise requirements for differential privacy.
Abstract
Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is well written and well motivated in general. - Seeking to improve the communication efficiency of federated LLM fine tuning seems to be an interesting research direction. - The proposed Fed-SB method is intuitive and easy to follow. - The experimental results of the Fed-SB method seem to be promising.
- The contribution of the proposed Fed-SB method seems to be marginal. I find it hard to identify significant algorithmic innovation, except for extending the LoRA-SB method to a federated setting. - The base model used in the experimental study seems to be somewhat outdated. - I am not sure if federated LLM fine tuning is a practical scenario, as centralized fine tuning appears to be dominating. And I failed to find any real-world adopting for federated LLM fine-tuning.
1. The paper is well-written and easy to understand. 2. The introduced method achieves exact updates for FL with LoRA fine-tuning.
1. The novelty of this work is limited. The proposed Fed-SB is built upon LoRA-SB [1], which is not an original contribution of this study. Moreover, such strategies have been applied in [2]. 2. The initialization of A and B can significantly impact performance, yet this paper lacks such an analysis, as it directly initializes A and B as orthonormal matrices. Exploring different initializations, such as random initialization or performing SVD decomposition on W_0 and using the decomposed values
1. The method is simple and effective. 2. The experiments show consistent gains across multiple tasks and multiple LLMs.
1. Despite strong experimental results, the submission lacks the core justification (theoretical or isolating experiments) that would explain why the method works and under what conditions it should be expected to work. 2. Lack of key ablation studies: (1) extreme low-rank regimes (rank-1, rank-2, …), to identify the expressivity threshold below which Fed-SB ceases to be effective; (2) different initializations of LoRA-B and LoRA-A, since LoRA-A, LoRA-B are frozen, initialization may be the dom
S1: The core idea is intuitive and easy to follow. S2: The method demonstrates strong empirical performance across various benchmarks, while using drastically fewer communication parameters per round.
W1: The paper lacks clarity on the initialization phase's implementation details. W2: Model performance appears highly sensitive to the initial subspace quality, risking a severe performance cap if the approximation phase uses limited or unrepresentative data. W3: The constrained update space (only matrix R is trained) raises overfitting concerns to the initial subspace, potentially limiting generalization.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
