FedMomentum: Preserving LoRA Training Momentum in Federated Fine-Tuning
Peishen Yan, Yang Hua, Hao Wang, Jiaru Zhang, Xiaoyu Wu, Tao Song, Haibing Guan

TL;DR
FedMomentum introduces a SVD-based aggregation method for federated LoRA fine-tuning of LLMs, effectively preserving training momentum, improving convergence speed, and enhancing task-specific performance.
Contribution
The paper proposes FedMomentum, a novel structured aggregation framework that maintains LoRA training momentum using SVD, addressing structural and convergence issues in federated fine-tuning.
Findings
FedMomentum outperforms existing methods in convergence speed.
It achieves higher final accuracy across multiple tasks.
The approach effectively preserves LoRA's low-rank structure.
Abstract
Federated fine-tuning of large language models (LLMs) with low-rank adaptation (LoRA) offers a communication-efficient and privacy-preserving solution for task-specific adaptation. Naive aggregation of LoRA modules introduces noise due to mathematical incorrectness when averaging the downsampling and upsampling matrices independently. However, existing noise-free aggregation strategies inevitably compromise the structural expressiveness of LoRA, limiting its ability to retain client-specific adaptations by either improperly reconstructing the low-rank structure or excluding partially trainable components. We identify this problem as loss of training momentum, where LoRA updates fail to accumulate effectively across rounds, resulting in slower convergence and suboptimal performance. To address this, we propose FedMomentum, a novel framework that enables structured and momentum-preserving…
Peer Reviews
Decision·Submitted to ICLR 2026
Clear problem framing around momentum loss in federated LoRA and a principled SVD-based fix. Consistent improvements and faster convergence over strong baselines, with thorough ablations. Practicality considered via randomized SVD and reporting of runtime overhead.
“Momentum” story isn’t operationalized: The paper motivates momentum preservation with spectra/visuals but never measures optimization continuity directly (e.g., gradient-direction alignment across rounds, cosine to prior updates, curvature drift). Without such probes, the claimed mechanism remains a narrative rather than an evidenced cause. Residuals add hidden communication/state costs: FedMomentum keeps a residual subspace (until ~99% energy is retained) and ships it back for backbone merg
1. Consistent improvements on math, commonsense, and code benchmarks. 2. Splitting Σ as Σ1/2 across B and A is a low-cost fix for singular-value skew.
1. All results are on LLaMA2-7B; behavior on newer or larger models is unknown. 2. Experiments use 10 clients with Dirichlet β=0.5. It’s unclear how momentum preservation holds under more extreme non-IID, partial participation, or straggler scenarios.
1. The experiments show consistent gains across multiple datasets.
1. FlexLoRA[1], which also uses SVD-based aggregation, though mentioned in the related works, should be thoroughly discussed, compared and included as an important baseline. The main incremental contribution of FedMomentum is its insights, but the method is highly similar to the closest work FlexLoRA and has not been fully compared; in the absence of rigorous positioning and systematic ablation, the existing performance improvement is difficult to be clearly attributed. 2. It is not clear how t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Privacy-Preserving Technologies in Data
