Scalable Model Merging with Progressive Layer-wise Distillation

Jing Xu; Jiazheng Li; Jingzhao Zhang

arXiv:2502.12706·cs.LG·May 28, 2025

Scalable Model Merging with Progressive Layer-wise Distillation

Jing Xu, Jiazheng Li, Jingzhao Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ProDistill, a novel layer-wise distillation method for scalable model merging that significantly improves performance in few-shot scenarios across vision and NLU tasks, especially for large models.

Contribution

The paper presents ProDistill, a new layer-wise distillation algorithm that enhances model merging scalability and performance, supported by theoretical analysis and extensive experiments.

Findings

01

ProDistill outperforms existing methods with up to 6.61% improvements.

02

Layer-wise distillation improves model merging performance.

03

ProDistill scales effectively to models over 10 billion parameters.

Abstract

Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JingXuTHU/Scalable_Model_Merging_with_Progressive_Layerwise_Distillation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMetaheuristic Optimization Algorithms Research