ATM: Improving Model Merging by Alternating Tuning and Merging

Luca Zhou; Daniele Solombrino; Donato Crisostomi; Maria Sofia Bucarelli; Fabrizio Silvestri; Emanuele Rodol\`a

arXiv:2411.03055·cs.LG·August 11, 2025

ATM: Improving Model Merging by Alternating Tuning and Merging

Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Fabrizio Silvestri, Emanuele Rodol\`a

PDF

Open Access 4 Reviews

TL;DR

This paper introduces ATM, a method that alternates between tuning and merging models, providing a theoretical basis for task vectors and improving model merging efficiency in various settings.

Contribution

It offers a theoretical motivation for task vectors and proposes ATM, a novel iterative approach that enhances model merging and multitask learning applications.

Findings

01

ATM improves model merging performance across vision tasks.

02

Task vectors are theoretically equivalent to multitask gradients.

03

ATM serves as an effective refinement step for existing merging methods.

Abstract

Model merging has emerged as a cost-efficient approximation to multitask learning. Among merging strategies, task arithmetic is notable for its simplicity and effectiveness. In this work, we provide a theoretical motivation for task vectors by highlighting that, under single-epoch full-batch gradient descent, they are equivalent to multitask gradients. This insight leads us to reinterpret model merging as a single step in an iterative procedure that Alternates between Tuning and Merging (ATM). We propose two applications of ATM: (1) as an alternative to multitask learning in scenarios where data sharing is restricted (e.g., federated settings), and (2) as a lightweight refinement step to improve existing model merging methods using a small validation set. Experiments across diverse vision tasks demonstrate the effectiveness of ATM.

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

The proposed method (ATM) is fairly simple, but is shown to have significantly higher performance over “one-shot” baselines and methods, where models are fine-tuned once and then merged. This demonstrates the value of merging models intermittently, rather than training models fully in a task-specific manner and then merging them. The paper also has some interesting analysis regarding the gradients of individual tasks over the course of training, which can help explain the efficacy of ATM and he

Weaknesses

My understanding is that model merging is primarily a technique for taking advantage of models which are fine-tuned separately from one another, e.g. combining separate open-source models trained by different individuals. In other words, model merging is a technique to construct multi-task models when multi-task data is not available, i.e. when a single joint model cannot be trained. The proposed method, however, assumes that we have access to the entire multi-task data at one time, in order to

Reviewer 02Rating 1Confidence 5

Strengths

The proposed method has a GPU memory advantage compared to gradient-balancing multi-task learning approaches, since the fine-tuning for each task can be performed independently.

Weaknesses

The paper has many flaws, from poor placement in the literature, misconfiguring baselines and ignoring the most relevant ones, overstating the results and representing known results as novel. Specifically: 1. The method is wrongly presented as model merging, while it is more closely related to joint fine-tuning. Imo, it cannot be seen as model merging, since the task combination is known a priori. The method is inflexible and defeats the purpose of model merging 2. Moreover, there is not a sin

Reviewer 03Rating 1Confidence 5

Strengths

1. The paper addresses model merging, an emerging area of high relevance for reducing fine-tuning costs in multi-task settings. 2. The initial theoretical insights linking task vectors with gradients of the loss for the corresponding tasks after the first gradient descent step are novel.

Weaknesses

1. **Misalignment with Model Merging Goals**. The paper appears to overlook the key aims of model merging. The motivation for merging models fine-tuned on different tasks is not only saving memory – as the introduction of the paper claims – but instead bypassing expensive joint fine-tuning by enabling the combination of models fine-tuned *independently* on separate tasks. Indeed, model merging methods often rely on available checkpoints from online hubs. Contrary to this goal, ATM’s merging proc

Reviewer 04Rating 5Confidence 4

Strengths

1. **Theoretical Support**: The authors use proven theories in some discussions to support their viewpoints. 2. **Comprehensive Ablation Study**: There is discussion and experimental verification of various aspects of the proposed method.

Weaknesses

1. **Scope of the Method and Application Scenario**: The proposed method requires original training data (at least validation data) and training, whereas the baselines (including Task Arithmetic, TIES, DARE, etc.) used for comparison are in fact data-free and training-free merging methods, allowing them to directly merge existing models. The requirements for data and training are obligatory for the authors’ method, thus making the application scenarios quite different. The authors should compare

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Service-Oriented Architecture and Web Services · Natural Language Processing Techniques