ATM: Improving Model Merging by Alternating Tuning and Merging
Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Fabrizio Silvestri, Emanuele Rodol\`a

TL;DR
This paper introduces ATM, a method that alternates between tuning and merging models, providing a theoretical basis for task vectors and improving model merging efficiency in various settings.
Contribution
It offers a theoretical motivation for task vectors and proposes ATM, a novel iterative approach that enhances model merging and multitask learning applications.
Findings
ATM improves model merging performance across vision tasks.
Task vectors are theoretically equivalent to multitask gradients.
ATM serves as an effective refinement step for existing merging methods.
Abstract
Model merging has emerged as a cost-efficient approximation to multitask learning. Among merging strategies, task arithmetic is notable for its simplicity and effectiveness. In this work, we provide a theoretical motivation for task vectors by highlighting that, under single-epoch full-batch gradient descent, they are equivalent to multitask gradients. This insight leads us to reinterpret model merging as a single step in an iterative procedure that Alternates between Tuning and Merging (ATM). We propose two applications of ATM: (1) as an alternative to multitask learning in scenarios where data sharing is restricted (e.g., federated settings), and (2) as a lightweight refinement step to improve existing model merging methods using a small validation set. Experiments across diverse vision tasks demonstrate the effectiveness of ATM.
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The proposed method (ATM) is fairly simple, but is shown to have significantly higher performance over “one-shot” baselines and methods, where models are fine-tuned once and then merged. This demonstrates the value of merging models intermittently, rather than training models fully in a task-specific manner and then merging them. The paper also has some interesting analysis regarding the gradients of individual tasks over the course of training, which can help explain the efficacy of ATM and he
My understanding is that model merging is primarily a technique for taking advantage of models which are fine-tuned separately from one another, e.g. combining separate open-source models trained by different individuals. In other words, model merging is a technique to construct multi-task models when multi-task data is not available, i.e. when a single joint model cannot be trained. The proposed method, however, assumes that we have access to the entire multi-task data at one time, in order to
The proposed method has a GPU memory advantage compared to gradient-balancing multi-task learning approaches, since the fine-tuning for each task can be performed independently.
The paper has many flaws, from poor placement in the literature, misconfiguring baselines and ignoring the most relevant ones, overstating the results and representing known results as novel. Specifically: 1. The method is wrongly presented as model merging, while it is more closely related to joint fine-tuning. Imo, it cannot be seen as model merging, since the task combination is known a priori. The method is inflexible and defeats the purpose of model merging 2. Moreover, there is not a sin
1. The paper addresses model merging, an emerging area of high relevance for reducing fine-tuning costs in multi-task settings. 2. The initial theoretical insights linking task vectors with gradients of the loss for the corresponding tasks after the first gradient descent step are novel.
1. **Misalignment with Model Merging Goals**. The paper appears to overlook the key aims of model merging. The motivation for merging models fine-tuned on different tasks is not only saving memory – as the introduction of the paper claims – but instead bypassing expensive joint fine-tuning by enabling the combination of models fine-tuned *independently* on separate tasks. Indeed, model merging methods often rely on available checkpoints from online hubs. Contrary to this goal, ATM’s merging proc
1. **Theoretical Support**: The authors use proven theories in some discussions to support their viewpoints. 2. **Comprehensive Ablation Study**: There is discussion and experimental verification of various aspects of the proposed method.
1. **Scope of the Method and Application Scenario**: The proposed method requires original training data (at least validation data) and training, whereas the baselines (including Task Arithmetic, TIES, DARE, etc.) used for comparison are in fact data-free and training-free merging methods, allowing them to directly merge existing models. The requirements for data and training are obligatory for the authors’ method, thus making the application scenarios quite different. The authors should compare
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Service-Oriented Architecture and Web Services · Natural Language Processing Techniques
