Cross-Architecture Distillation Made Simple with Redundancy Suppression
Weijia Zhang, Yuehao Liu, Wu Ran, Chao Ma

TL;DR
This paper introduces a simple, efficient method for cross-architecture knowledge distillation that suppresses redundant information, outperforming complex existing methods like OFA with fewer parameters and broad applicability.
Contribution
The authors propose a novel redundancy suppression distillation (RSD) loss that simplifies cross-architecture knowledge transfer without architecture-specific modules.
Findings
Outperforms OFA on CIFAR-100 and ImageNet-1k benchmarks
Uses fewer parameters than existing methods
Provides a simple, effective baseline for cross-architecture distillation
Abstract
We describe a simple method for cross-architecture knowledge distillation, where the knowledge transfer is cast into a redundant information suppression formulation. Existing methods introduce sophisticated modules, architecture-tailored designs, and excessive parameters, which impair their efficiency and applicability. We propose to extract the architecture-agnostic knowledge in heterogeneous representations by reducing the redundant architecture-exclusive information. To this end, we present a simple redundancy suppression distillation (RSD) loss, which comprises cross-architecture invariance maximisation and feature decorrelation objectives. To prevent the student from entirely losing its architecture-specific capabilities, we further design a lightweight module that decouples the RSD objective from the student's internal representations. Our method is devoid of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
