Model Merging via Multi-Teacher Knowledge Distillation

Seyed Arshan Dalili; Mehrdad Mahdavi

arXiv:2512.21288·cs.LG·December 25, 2025

Model Merging via Multi-Teacher Knowledge Distillation

Seyed Arshan Dalili, Mehrdad Mahdavi

PDF

Open Access

TL;DR

This paper introduces a theoretically grounded approach to model merging using multi-teacher knowledge distillation, leveraging flatness-aware bounds and Sharpness-Aware Minimization to improve generalization in multi-task learning scenarios.

Contribution

It establishes a novel PAC-Bayes generalization bound for model merging and operationalizes it through a flatness-aware distillation method called SAMerging.

Findings

01

Achieves state-of-the-art results on vision benchmarks

02

Demonstrates improved robustness and generalization

03

Provides theoretical insights into model merging dynamics

Abstract

Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications