Collective Model Intelligence Requires Compatible Specialization
Jyothish Pari, Samy Jelassi, Pulkit Agrawal

TL;DR
This paper investigates the limitations of model merging through averaging, introduces compatible specialization as a new approach for collective model intelligence, and emphasizes the importance of aligning input-output spaces for effective merging.
Contribution
It proposes compatible specialization and routing-based merging strategies to improve the integration of specialized models, addressing representational divergence issues.
Findings
Feature similarity decreases with specialization, hindering merging.
Routing-based strategies improve merging by combining multi-layer features.
Compatibility in input-output spaces is crucial for effective model merging.
Abstract
In this work, we explore the limitations of combining models by averaging intermediate features, referred to as model merging, and propose a new direction for achieving collective model intelligence through what we call compatible specialization. Current methods for model merging, such as parameter and feature averaging, struggle to effectively combine specialized models due to representational divergence during fine-tuning. As models specialize to their individual domains, their internal feature representations become increasingly incompatible, leading to poor performance when attempting to merge them for new tasks. We analyze this phenomenon using centered kernel alignment (CKA) and show that as models specialize, the similarity in their feature space structure diminishes, hindering their capacity for collective use. To address these challenges, we investigate routing-based merging…
Peer Reviews
Decision·Submitted to ICLR 2025
- Insightful Identification of Compatibility Issues: The paper highlights that representational divergence during fine-tuning can hinder model merging efforts, which is a fundamental issue when merging any kind of machine learning models; hence the paper is well motivated. - The paper is generally well-written and offers useful analogies for the reader to ground his or her intuitions correctly. - The reader can smoothly go over the paper, as academic language develops every concept with eloq
- Arguing that models require **compatible specialization** is central to the purposes of this paper. Yet the concept remains ill-defined after the methodology section, which brings certain issues such as: - no direct/intuitive connection between routing mechanisms and a neural model's _ability to communicate its knowledge_; - lack of literature review on past methods that could, at least, somehow contribute to enhancing compatible specialization; - While Figure 6 showcases how layer inc
- **Originality**: This paper introduces a fresh perspective on model merging by emphasizing the importance of compatible specialization rather than direct feature merging. The application of centered kernel alignment (CKA) in this context is a novel approach that strengthens the paper’s analysis and arguments. - **Clarity**: The paper is well-organized, with clear sections and helpful diagrams (e.g., Figures 1 and 4) that effectively illustrate the merging processes and highlight the limitat
- **Quality**: The paper’s experimental scope is limited, and expanding it to include a wider range of tasks, such as those in SuperGLUE [[1](https://arxiv.org/abs/1905.00537)], would strengthen its findings. Additionally, comparisons with other benchmark model merging of Mixture of Experts (MoE) methods, such as those presented in recent works [[2](https://arxiv.org/pdf/2406.09770), [3](https://arxiv.org/abs/2402.00433), [4](https://arxiv.org/pdf/2402.05859), and [5](https://arxiv.org/pdf/2306.
The proposed MoE-based approach introduces a fresh perspective to the challenge of model merging by routing between domain-specific models. This strategy could address issues with prior methods that rely on simple parameter interpolation, particularly when models are poorly aligned in terms of CKA.
1. The paper only uses cross-entropy (CE) loss to evaluate the merged model, which may be insufficient to capture the model’s true performance, particularly in language modeling tasks. CE loss alone may not fully reflect the merged model’s language modeling and generation quality. Following guidelines from prior work ([1]), it would be beneficial to include additional metrics that measure both the intrinsic quality of the model (e.g., perplexity and accuracy) and the quality of generated text (e
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms
