From Coefficients to Directions: Rethinking Model Merging with Directional Alignment
Zhikang Chen, Sen Cui, Deheng Ye, Min Zhang, Gang Niu, Yu Zhang, Masashi Sugiyama, Tingting Zhu

TL;DR
This paper introduces a novel model merging method that emphasizes aligning directional structures in parameter and feature spaces, improving coherence and performance over traditional coefficient-based approaches.
Contribution
It proposes a unified geometric framework called Merging with Directional Alignment that enhances model merging by focusing on directional consistency.
Findings
Directional alignment improves structural coherence.
The method outperforms traditional merging strategies.
Extensive experiments validate effectiveness across tasks.
Abstract
Model merging has emerged as a practical paradigm for integrating multiple independently trained models into a single model without joint retraining. Previous studies have demonstrated the effectiveness of combining parameters through strategies such as parameter decomposition, coefficient optimization, and subspace learning, significantly reducing the need for expensive joint training and achieving strong empirical performance across diverse tasks. However, these approaches predominantly treat merging as a problem of parameter space decomposition or fusion coefficient optimization, while overlooking the critical role of directional information in both parameter and feature spaces. In practice, na\"ive merging introduces inconsistencies in dominant parameter directions and disrupts structural coherence across models, which can degrade performance. Moreover, coefficient-based…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Well-motivated perspective:** This paper formulates model merging as a problem of geometric alignment (direction) rather than just coefficient optimization (magnitude). Moreover, the paper provides a solid theoretical grounding for why directional alignment should matter by the connection to neural collapse and the simplex ETF. - **Comprehensive evaluation and theoretical analysis:** The experiment is well-designed and comprehensive. Its evaluation spans multiple architectures, diverse tas
- **Questionable assumption about neural collapse in merged models:** The paper assumes that multi-class joint training induces simplex ETF geometry and **that merged models should approximate this structure**. Why? In fact, merged models fundamentally differ from jointly trained models: they aggregate task-specific adaptations rather than training jointly from scratch. No evidence is provided showing that merged models actually violate ETF structure or that imposing ETF structure better approxi
- The idea of explicitly enforcing directional alignment in model merging is original in its formulation and integration of neural collapse geometry into the merging framework. - The empirical evaluation is extensive, spanning 8–20 task vision benchmarks and GLUE-style NLP datasets. The improvements, though moderate (typically +0.5–2%), are consistent and show that directional alignment has a measurable effect. The paper provides clear ablations for the loss weights and module contributions, sh
- Despite focusing on geometry, the paper lacks quantitative or visual analysis of directional statistics: e.g., angular distributions between task vectors before/after alignment, cosine similarity matrices, or CKA correlation trends. - The distinction between MDA and prior geometric methods (TSV, ISO, DOGE) remains unclear. Many of these already manipulate task subspaces or singular vector orientations—essentially performing partial directional alignment. - The feature-space optimization involv
- method obtains good results on standard vision model mering benchmark dataset, including good generalization. - gain of method increases for more difficult 20 tasks scenario and also for larger models (ViT L/14)
- I found the presentation lacking. The paper is very hard the follow and I found often unclear and incorrect.. (E.g. The introduction of Neural collapse (section 3.2) needs motivation. In the start of section 4, the main motivation claims that 'Other methods (Gargiulo, wei, Marczak) are based on task-specific subspaces which are claimed to be computationally expensive. They do not only focus on task-specific subspaces, and the paper lacks a computational comparison, etc). - The results of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis
