Model soups need only one ingredient
Alireza Abdollahpoorrostam, Nikolaos Dimitriadis, Adam Hazimeh, Pascal Frossard

TL;DR
MonoSoup is a simple, data-free post-hoc method that improves out-of-distribution robustness of fine-tuned models using only a single checkpoint by decomposing and reweighing layer updates based on spectral analysis.
Contribution
Introduces MonoSoup, a novel single-checkpoint, spectral decomposition-based method that balances in-distribution accuracy and out-of-distribution robustness without additional training.
Findings
MonoSoup achieves comparable ID-OOD performance to multi-checkpoint methods.
It effectively reweighs layer updates using spectral information.
The method is practical and computationally efficient.
Abstract
Fine-tuning large pre-trained models on a target distribution often improves in-distribution (ID) accuracy, but at the cost of out-of-distribution (OOD) robustness as representations specialize to the fine-tuning data. Weight-space ensembling methods, such as Model Soups, mitigate this effect by averaging multiple checkpoints, but they are computationally prohibitive, requiring the training and storage of dozens of fine-tuned models. In this paper, we introduce MonoSoup, a simple, data-free, hyperparameter-free, post-hoc method that achieves a strong ID-OOD balance using only a single checkpoint. Our method applies Singular Value Decomposition (SVD) to each layer's update and decomposes it into high-energy directions that capture task-specific adaptation and low-energy directions that introduce noise but may still encode residual signals useful for robustness. MonoSoup then uses…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper tackles a practical and well-motivated challenge: retaining OOD robustness without storing or training multiple fine-tuned checkpoints. Its formulation is conceptually elegant, connecting the empirical success of model soups to the internal spectral geometry of a single model. The SVD-based decomposition provides an interpretable view of fine-tuning dynamics, distinguishing high-energy task adaptation from low-energy robustness-preserving directions. The adaptive weighting rule is simp
1. **Limited novelty beyond existing SVD-based merging.** While the paper frames its contribution as extending model soups to a single-model setting, the actual main operation—SVD decomposition of the fine-tuning update followed by spectral weighting—closely parallels prior works (e.g., Task Singular Vectors, Model Merging with SVD). The paper’s novelty primarily lies in its interpretation rather than in a fundamentally new algorithmic principle. 2. **Heuristic coefficient design without theo
The reviewer notes the following strengths: - The paper presents a clear context for MonoSoup with a defined motivation for the development of the underlying methodology. - The proposed methodology (MonoSoup) is light-weight and readily applicable to real-world settings. - MonoSoup showcases strong empirical performance across multiple models & tasks. - The author also provide strong intuitive background for MonoSoup through analysis, linking performance improvements to alignment between fine-tu
The reviewer notes the following weaknesses: - The reviewer’s primary concern is that, while the paper’s motivation is clearly stated, the argument that storing only a single best-performing checkpoint necessitates the development of MonoSoup is unconvincing. In particular, the reviewer finds it unlikely that, in practice, there would be meaningful constraints on retaining multiple checkpoints during model training. - Additional evaluations on other modalities like audio would also provide even
1. The paper addresses a practical and realistic scenario. While many merging methods like Model Soups require access to dozens of checkpoints, real-world applications often only store a single, best-performing model. The paper's focus on improving robustness from this single-mode" setting is a valuable contribution. 2. The core idea of using SVD to decompose the fine-tuning update into high-energy (specialization) and low-energy (robustness) components is interesting.
1. The paper's method for calculating the alignment coefficient $cos~\alpha^{(l)}$ (lines 264-269) is theoretically unclear and appears to be a significant conceptual leap. The method computes the cosine similarity between two conceptually different entities: the pre-trained weights $W_0^l$ (an absolute state vector) and the low-energy update $W_{low}^l$ (a difference vector). There is no clear justification for why the alignment between a 'state' and a 'difference' is a meaningful measure of kn
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
