Model Merging via Multi-Teacher Knowledge Distillation
Seyed Arshan Dalili, Mehrdad Mahdavi

TL;DR
This paper introduces a theoretically grounded approach to model merging using multi-teacher knowledge distillation, leveraging flatness-aware bounds and Sharpness-Aware Minimization to improve generalization in multi-task learning scenarios.
Contribution
It establishes a novel PAC-Bayes generalization bound for model merging and operationalizes it through a flatness-aware distillation method called SAMerging.
Findings
Achieves state-of-the-art results on vision benchmarks
Demonstrates improved robustness and generalization
Provides theoretical insights into model merging dynamics
Abstract
Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
