The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
Seungwoo Son, Jegwang Ryu, Namhoon Lee, Jaeho Lee

TL;DR
This paper proposes a masking-based framework to reduce the computational cost of supervising vision transformer distillation, achieving significant efficiency gains without sacrificing student accuracy.
Contribution
It introduces a simple masking method that selectively skips teacher computations during distillation, improving efficiency while maintaining performance.
Findings
Masking low-attention patches saves up to 50% FLOPs.
Student-guided masking acts as an effective curriculum.
No accuracy loss with the proposed masking strategy.
Abstract
Knowledge distillation is an effective method for training lightweight vision models. However, acquiring teacher supervision for training samples is often costly, especially from large-scale models like vision transformers (ViTs). In this paper, we develop a simple framework to reduce the supervision cost of ViT distillation: masking out a fraction of input tokens given to the teacher. By masking input tokens, one can skip the computations associated with the masked tokens without requiring any change to teacher parameters or architecture. We find that masking patches with the lowest student attention scores is highly effective, saving up to 50% of teacher FLOPs without any drop in student accuracy, while other masking criterion leads to suboptimal efficiency gains. Through in-depth analyses, we reveal that the student-guided masking provides a good curriculum to the student, making…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
