SpectralKD: A Unified Framework for Interpreting and Distilling Vision Transformers via Spectral Analysis
Huiyuan Tian, Bonan Xu, Shijian Li, Gang Pan

TL;DR
SpectralKD introduces a spectral analysis framework for understanding and improving knowledge distillation in Vision Transformers, achieving state-of-the-art results without additional trainable parameters.
Contribution
The paper presents a unified spectral analysis framework for ViTs and KD, revealing layer importance and spectral patterns, and proposes a simple spectral alignment method for effective knowledge distillation.
Findings
Layer-wise analysis shows CaiT concentrates information in first and last layers.
Spectral patterns are similar across different ViT architectures.
The spectral alignment method improves top-1 accuracy on ImageNet-1K.
Abstract
Knowledge Distillation (KD) has achieved widespread success in compressing large Vision Transformers (ViTs), but a unified theoretical framework for both ViTs and KD is still lacking. In this paper, we propose SpectralKD, a novel unified analytical framework that offers deeper insights into ViTs and optimizes KD via spectral analysis. Our model-wise analysis reveals that CaiT concentrates information in their first and last few layers, informing optimal layer selection for KD. Surprisingly, our layer-wise analysis discovers that Swin Transformer and CaiT exhibit similar spectral encoding patterns despite their architectural differences, leading to feature map alignment guideline. Building on these insights, we propose a simple yet effective spectral alignment method for KD. Benefiting from the deeper understanding by above analysis results, even such a simple strategy achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · CCD and CMOS Imaging Sensors · Neural Networks and Applications
MethodsAttention Is All You Need · Stochastic Depth · Byte Pair Encoding · Class Attention · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Dense Connections · Residual Connection
