From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers
Huiyuan Tian, Bonan Xu, Shijian Li

TL;DR
This paper investigates why feature-map knowledge distillation fails in Vision Transformers and proposes simple fixes that significantly improve compressed ViT performance on ImageNet.
Contribution
It uncovers the encoding mismatch phenomenon in ViT distillation and introduces two minimal remedies, Lift and WideLast, to address this issue.
Findings
Lift and WideLast improve ViT distillation accuracy on ImageNet.
Encoding mismatch explains the failure of feature-map KD in ViTs.
Proposed fixes also enhance students trained without distillation.
Abstract
Feature-map knowledge distillation (KD) transfers internal representations well between comparably sized Vision Transformers (ViTs), but it often fails in compression. We revisit this failure and uncover a paradox. Sample-wise SVD shows that each image is highly compressible, which seems to suggest that a narrow student with a linear projector should match the teacher "in principle". However, a dataset-level view contradicts this intuition: PCA shows that the teacher is a union of low-rank subspaces with significant subspace rotation across inputs. We further introduce token-level Spectral Energy Patterns (SEP) and find an architecture-invariant encoding law: tokens spread energy broadly across channel modes even when they live in low-rank subspace, creating a bandwidth mismatch. We refer to this combined phenomenon as an encoding mismatch. We propose two minimal remedies, Lift or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
