Expandable Residual Approximation for Knowledge Distillation
Zhaoyi Yan, Binghui Chen, Yunfan Liu, Qixiang Ye

TL;DR
This paper introduces Expandable Residual Approximation (ERA), a novel knowledge distillation method inspired by the Stone-Weierstrass theorem, which decomposes residual knowledge to better transfer from teacher to student models, improving performance across vision tasks.
Contribution
ERA employs residual decomposition and a teacher weight reuse strategy to address capacity gaps in knowledge distillation, advancing the effectiveness of model compression techniques.
Findings
Improves ImageNet Top-1 accuracy by 1.41%.
Enhances MS COCO AP by 1.40.
Achieves state-of-the-art results across vision benchmarks.
Abstract
Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher's representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
