Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication
Zhe Zhao, Qingyun Liu, Huan Gui, Bang An, Lichan Hong, Ed H. Chi

TL;DR
This paper introduces an interactive communication framework for knowledge distillation that enables more effective transfer of knowledge from large pre-trained models to downstream models, improving performance over traditional methods.
Contribution
It proposes a novel interactive distillation method with encoder-decoder components allowing tailored knowledge transfer based on model capacity and task distribution.
Findings
Outperforms state-of-the-art distillation techniques on benchmark datasets.
Enables downstream models to learn more effectively from pre-trained models.
Facilitates better adaptation to task-specific distributions.
Abstract
Many recent breakthroughs in machine learning have been enabled by the pre-trained foundation models. By scaling up model parameters, training data, and computation resources, foundation models have significantly advanced the state-of-the-art in many applications. However, it is still an open question of how to use these models to perform downstream tasks efficiently. Knowledge distillation (KD) has been explored to tackle this challenge. KD transfers knowledge from a large teacher model to a smaller student model. While KD has been successful in improving student model performance, recent research has discovered that a powerful teacher does not necessarily lead to a powerful student, due to their huge capacity gap. In addition, the potential distribution shifts between the pre-training data and downstream tasks can make knowledge transfer in KD sub-optimal for improving downstream task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
MethodsKnowledge Distillation
