Bridging Training and Merging Through Momentum-Aware Optimization
Alireza Moayedikia, Alicia Troncoso

TL;DR
This paper introduces a unified framework that maintains and reuses curvature and momentum information during training to improve model merging and task-specific model composition, achieving better performance and efficiency.
Contribution
It proposes a method to keep and utilize curvature statistics during training for geometry-aware model merging, reducing computation and improving model composition.
Findings
Curvature-aware merging outperforms magnitude-only baselines across sparsity levels.
Multi-task merging improves performance by 1.6% over strong baselines.
The framework exhibits rank-invariant convergence and robustness to hyperparameters.
Abstract
Training large neural networks and merging task-specific models both exploit low-rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging--wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry-aware model composition. The proposed method incurs modest memory overhead (approximately 30% over AdamW) to accumulate task saliency scores that enable curvature-aware merging. These scores, computed as a byproduct of optimization, provide importance estimates comparable to post-hoc Fisher computation while producing merge-ready models directly from training. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
