Bridging Training and Merging Through Momentum-Aware Optimization

Alireza Moayedikia; Alicia Troncoso

arXiv:2512.17109·cs.LG·March 30, 2026

Bridging Training and Merging Through Momentum-Aware Optimization

Alireza Moayedikia, Alicia Troncoso

PDF

TL;DR

This paper introduces a unified framework that maintains and reuses curvature and momentum information during training to improve model merging and task-specific model composition, achieving better performance and efficiency.

Contribution

It proposes a method to keep and utilize curvature statistics during training for geometry-aware model merging, reducing computation and improving model composition.

Findings

01

Curvature-aware merging outperforms magnitude-only baselines across sparsity levels.

02

Multi-task merging improves performance by 1.6% over strong baselines.

03

The framework exhibits rank-invariant convergence and robustness to hyperparameters.

Abstract

Training large neural networks and merging task-specific models both exploit low-rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging--wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry-aware model composition. The proposed method incurs modest memory overhead (approximately 30% over AdamW) to accumulate task saliency scores that enable curvature-aware merging. These scores, computed as a byproduct of optimization, provide importance estimates comparable to post-hoc Fisher computation while producing merge-ready models directly from training. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.