Weight Averaging Improves Knowledge Distillation under Domain Shift
Valeriy Berezovskiy, Nikita Morozov

TL;DR
This paper introduces Weight-Averaged Knowledge Distillation (WAKD), a simple yet effective method that applies weight averaging techniques to improve student model performance in knowledge distillation under domain shift conditions.
Contribution
It bridges knowledge distillation and domain generalization by applying weight averaging methods, including a new simple strategy, to enhance performance under domain shift.
Findings
Weight averaging improves KD under domain shift.
The proposed simple averaging method performs comparably to SWAD and SMA.
WAKD enhances generalization of student networks in unseen domains.
Abstract
Knowledge distillation (KD) is a powerful model compression technique broadly used in practical deep learning applications. It is focused on training a small student network to mimic a larger teacher network. While it is widely known that KD can offer an improvement to student generalization in i.i.d setting, its performance under domain shift, i.e. the performance of student networks on data from domains unseen during training, has received little attention in the literature. In this paper we make a step towards bridging the research fields of knowledge distillation and domain generalization. We show that weight averaging techniques proposed in domain generalization literature, such as SWAD and SMA, also improve the performance of knowledge distillation under domain shift. In addition, we propose a simplistic weight averaging strategy that does not require evaluation on validation data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Seismic Imaging and Inversion Techniques · Multimodal Machine Learning Applications
MethodsBitcoin Customer Service Number +1-833-534-1729 · Knowledge Distillation · Data-efficient Image Transformer · Stochastic Weight Averaging
