Model Merging with Functional Dual Anchors
Kexuan Shi, Yandong Wen, Weiyang Liu

TL;DR
This paper introduces Functional Dual Anchors (FDAs), a novel framework for model merging that models input-representation space instead of parameters, improving robustness and flexibility in integrating multiple finetuned models.
Contribution
FDAs provide a new input-based approach to model merging, bridging multi-task training and post-hoc merging, with a principled initialization scheme and demonstrated effectiveness.
Findings
FDAs outperform parameter-space methods in model merging tasks.
FDAs are complementary to existing parameter-based merging techniques.
Experiments show improved robustness and flexibility in model integration.
Abstract
Model merging is an efficient post-training strategy for integrating knowledge from multiple finetuned checkpoints of a shared foundation model. Existing methods operate in the parameter space, combining task vectors to mitigate conflicts, but remain constrained by parameter inconsistencies. We propose Functional Dual Anchors (FDAs), a framework that instead models the input-representation space. FDAs are synthetic inputs whose induced gradients align with task vectors, capturing task-specific functional shifts relative to the pretrained model. This perspective bridges joint multi-task training and post-hoc merging, offering both robustness and flexibility. We further introduce a principled initialization scheme and show that FDAs are complementary to parameter-space model merging. Comprehensive experiments demonstrate the effectiveness of FDAs in model merging.
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper provides a novel and insightful perspective on model merging. 2. The proposed method is well-motivated by a solid theoretical analysis. 3. The experimental validation is comprehensive and robust, convincingly demonstrating FDA's effectiveness.
1. The FDA construction process involves a nested optimization problem that requires computing second-order gradients, leading to a significant computational overhead. Although the layer-wise strategy makes it tractable, the method is inherently more expensive than one-shot approaches like TA or WUDI, and the paper lacks a quantitative analysis of this extra cost, which raises concerns about its practical utility. 2. While FDAs show the ability to enhance existing methods, the performance gains
1)New approach to merging: Proposes a method that projects task knowledge into the input–representation space rather than directly manipulating parameter vectors. This connects joint multi-task training (input-centric) with post-hoc weight averaging (parameter-centric), providing an alternative design perspective. 2)Theoretically grounded initialisation: Derives closed-form dynamics for a linear encoder and shows that tail eigen-energy of the task vector slows convergence. Two simple initialisat
1)The motivation is unclear, and the paper lacks a clear explanation of why input representations can be used to replace task vectors in model merging. 2)Limited theoretical justification beyond linear case: All convergence claims rest on a single-layer linear encoder (Sec. 2.2); no analysis for non-convex deep nets or layer interactions. No guarantee that gradient-aligned synthetic points transfer to real-data loss basins. 3)Missing statistical significance: Reported numbers are single-run mean
This paper provide a new way to understand the merging process. By reinterpreting task vectors as gradients induced by synthetic inputs, FDAs bridge the gap between multi-task learning and post-hoc model merging, offering a new functional perspective for knowledge consolidation. The discussions are sound and insightful.
1. In lines 55-56, the authors claimed that 'we shift the merging process into the input space, where representations can naturally capture task-specific variations.' Why does the 'merging process in the input space' can 'naturally' capture task specific variantions? 2. The whole pipeline seems to be computationally intensive. What's the exact learning time? Compared to baselines, will the performance increase be enough to offset the increase in computing complexity? 3. In line 53, the authors
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
