Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport

Zecheng Pan; Zhikang Chen; Ding Li; Min Zhang; Sen Cui; Hongshuo Jin; Luqi Tao; Yi Yang; Deheng Ye; Yu Zhang; Tingting Zhu; Tianling Ren

arXiv:2511.19561·cs.LG·November 26, 2025

Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport

Zecheng Pan, Zhikang Chen, Ding Li, Min Zhang, Sen Cui, Hongshuo Jin, Luqi Tao, Yi Yang, Deheng Ye, Yu Zhang, Tingting Zhu, Tianling Ren

PDF

Open Access 3 Reviews

TL;DR

This paper introduces OTMF, a novel optimal transport-based framework for merging task-specific models that preserves task knowledge and improves efficiency, outperforming existing parameter interpolation methods.

Contribution

OTMF leverages optimal transport to align semantic features, enabling scalable, continual fusion of models without revisiting previous tasks.

Findings

01

OTMF achieves state-of-the-art accuracy on vision and language benchmarks.

02

OTMF maintains bounded memory footprint during continual fusion.

03

OTMF outperforms traditional parameter interpolation methods.

Abstract

Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- the usage of distribution alignment with optimal transport is new in the context of model merging. - resulsts show improvements over several baselines on vision (Results on language (table 3) are significantly weaker)

Weaknesses

- The presentation of the paper can be much improved. The authors have included all the components of their method in Sections 3 and 4, but much of the work is left to the reader to connect them (using for example the algorithm). The actual use of the loss described in Section 3.2 is not discussed in the context of optimizing the masks introduced later in Section 4.2. The alternating nature of the mask optimization is mentioned only in Algorithm 1, but it should also be discussed explicitly in t

Reviewer 02Rating 2Confidence 4

Strengths

- The use of the Sinkhorn distance in the context of continual model merging is novel. - Extensive comparisons on both vision and language tasks demonstrate that the proposed method achieves good performance.

Weaknesses

1) The paper is poorly written in terms of structure and contains numerous grammatical and spelling errors. In the following paragraphs more details: - The narrative of the paper is difficult to follow and makes it difficult to understand how the proposed methodology works. Section 3.3 (within the Preliminaries) already describes part of the overall framework and should therefore be merged with Section 4.1. In its current form, the presentation is fragmented: the Sinkhorn loss from optimal tr

Reviewer 03Rating 4Confidence 4

Strengths

1. The core idea of using optimal transport to guide learnable masks for aligning pre- and post-task distributions is conceptually interesting and addresses a real limitation in parameter-space merging methods (e.g., catastrophic forgetting due to distribution shifts, especially under continual model merging setting where previous task-specific models are not available). 2. The continual fusion paradigm, which only requires the current merged model and the new task model, is practical for scalab

Weaknesses

1. A significant concern arises from OTMF's reliance on **labeled data** for fine-tuning the classification heads after each merging step. The using of labeled data is atypical in the model merging literature, techniques like task arithmetic, ties merging and opcm are data-free methods, and AdaMerging only using unlabeled data for test-time adaptation. 2. The performance of merged model without classification heads tuning should also be listed at Table 1. 3. Only high-level results such as avera

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications