Scalable Model Merging with Progressive Layer-wise Distillation
Jing Xu, Jiazheng Li, Jingzhao Zhang

TL;DR
This paper introduces ProDistill, a novel layer-wise distillation method for scalable model merging that significantly improves performance in few-shot scenarios across vision and NLU tasks, especially for large models.
Contribution
The paper presents ProDistill, a new layer-wise distillation algorithm that enhances model merging scalability and performance, supported by theoretical analysis and extensive experiments.
Findings
ProDistill outperforms existing methods with up to 6.61% improvements.
Layer-wise distillation improves model merging performance.
ProDistill scales effectively to models over 10 billion parameters.
Abstract
Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetaheuristic Optimization Algorithms Research
