TL;DR
This paper introduces a novel channel-wise knowledge distillation method for semantic segmentation that aligns feature channels between teacher and student networks using KL divergence, improving performance and efficiency.
Contribution
It proposes a new channel-wise distillation approach that focuses on soft distribution alignment of feature channels, outperforming existing spatial methods in semantic segmentation.
Findings
Outperforms existing spatial distillation methods in semantic segmentation.
Requires less computational cost during training.
Achieves superior performance on multiple benchmarks.
Abstract
Knowledge distillation (KD) has been proven to be a simple and effective tool for training compact models. Almost all KD variants for dense prediction tasks align the student and teacher networks' feature maps in the spatial domain, typically by minimizing point-wise and/or pair-wise discrepancy. Observing that in semantic segmentation, some layers' feature activations of each channel tend to encode saliency of scene categories (analogue to class activation mapping), we propose to align features channel-wise between the student and teacher networks. To this end, we first transform the feature map of each channel into a probabilty map using softmax normalization, and then minimize the Kullback-Leibler (KL) divergence of the corresponding channels of the two networks. By doing so, our method focuses on mimicking the soft distributions of channels between networks. In particular, the KL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax
