The Role of Masking for Efficient Supervised Knowledge Distillation of   Vision Transformers

Seungwoo Son; Jegwang Ryu; Namhoon Lee; Jaeho Lee

arXiv:2302.10494·cs.LG·September 30, 2024·5 cites

The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

Seungwoo Son, Jegwang Ryu, Namhoon Lee, Jaeho Lee

PDF

Open Access

TL;DR

This paper proposes a masking-based framework to reduce the computational cost of supervising vision transformer distillation, achieving significant efficiency gains without sacrificing student accuracy.

Contribution

It introduces a simple masking method that selectively skips teacher computations during distillation, improving efficiency while maintaining performance.

Findings

01

Masking low-attention patches saves up to 50% FLOPs.

02

Student-guided masking acts as an effective curriculum.

03

No accuracy loss with the proposed masking strategy.

Abstract

Knowledge distillation is an effective method for training lightweight vision models. However, acquiring teacher supervision for training samples is often costly, especially from large-scale models like vision transformers (ViTs). In this paper, we develop a simple framework to reduce the supervision cost of ViT distillation: masking out a fraction of input tokens given to the teacher. By masking input tokens, one can skip the computations associated with the masked tokens without requiring any change to teacher parameters or architecture. We find that masking patches with the lowest student attention scores is highly effective, saving up to 50% of teacher FLOPs without any drop in student accuracy, while other masking criterion leads to suboptimal efficiency gains. Through in-depth analyses, we reveal that the student-guided masking provides a good curriculum to the student, making…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning