An Empirical Study and Theoretical Explanation on Task-Level Model-Merging Collapse
Yuan Cao, Dezhi Ran, Yuzhe Guo, Mengzhou Wu, Simin Chen, Linyi Li, Wei Yang, Tao Xie

TL;DR
This paper investigates why merging independently fine-tuned language models sometimes fails catastrophically, revealing that representational incompatibility causes collapse and providing a theoretical framework to understand these limits.
Contribution
It identifies task-level merging collapse as a key failure mode, characterizes its causes, and offers a theoretical explanation based on rate-distortion theory.
Findings
Representational incompatibility correlates with merging collapse.
Parameter-space conflict metrics show minimal correlation.
Theoretical bounds on task mergeability are established.
Abstract
Model merging unifies independently fine-tuned LLMs from the same base, enabling reuse and integration of parallel development efforts without retraining. However, in practice we observe that merging does not always succeed: certain combinations of task-specialist models suffer from catastrophic performance degradation after merging. We refer to this failure mode as merging collapse. Intuitively, collapse arises when the learned representations or parameter adjustments for different tasks are fundamentally incompatible, so that merging forces destructive interference rather than synergy. In this paper, we identify and characterize the phenomenon of task-level merging collapse, where certain task combinations consistently trigger huge performance degradation across all merging methods. Through extensive experiments and statistical analysis, we demonstrate that representational…
Peer Reviews
Decision·Submitted to ICLR 2026
Strenghts: - Good study across merging methods and domains - Concretely find that task incompatibility causes merging collapse - Propose a hidden similarity metric for guiding merging
- There is no significant understanding on merging collapse as proposed in abstract. It is known that task incompatilibity causes merging failure, as it cannot maintain linear mode connectivity. - Merging is interesting in generalization perspective, the paper doesn’t study anything related to that. If the goal is to only to retain performance of existing models, then even past works have seen that it is not possible to always retain complete performance i.e, merging collapse will happen. - The
- The paper provides a well-structured empirical investigation of model merging failures. By systematically testing correlations between merging loss, task identity, and several parameter-space metrics, it offers strong evidence that task characteristics primarily drive performance degradation. - The paper introduced a novel metric which connects merging performance to the geometry of model representations and correlates with performance degradation in merging. - The provide a theoretical resu
- The paper does not engage with existing literature tackling the same issue from complementary angles. Works such as for intance Task Singular Vectors [1], and Iso-C [2] directly analyze model merging collapse through task subspace geometry and rank alignment, yet are not cited or discussed. This omission limits the contextualization of the proposed study and leaves unclear how it connects to or extends prior understanding of merging failure. - Evaluating the hidden-state diameter $\Delta$ or t
1. Breadth of evidence for collapse. Systematic experiments across multiple merging rules and backbones show the collapse phenomenon is widespread and primarily task-driven rather than method-driven. 2. Representation-space explanation that matches practice. A rate–distortion analysis yields a dimension-dependent lower bound on representation distortion for convex merges, providing a principled reason merges can fail and aligning with the empirical signal. 3. Actionable pre-screening tools. Hi
1. Theory relies on strong assumptions. The lower-bound analysis assumes linear-mode connectivity and last-layer linearity. 2. Narrow design of HiddenSim. It uses few samples and only the last layer with L2 distance. Including ablations over sample size, layer choice, pooling, and alternative metrics would help. 3. Results are centered on GLUE and Lots-of-LoRAs. The authors may include non-classification or code/generation tasks and additional model backbones to test generality.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Engineering Research · Parallel Computing and Optimization Techniques
