Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation
Zhiwei Hao, Jianyuan Guo, Ding Jia, Kai Han, Yehui Tang, Chao Zhang,, Han Hu, Yunhe Wang

TL;DR
This paper introduces a fine-grained manifold distillation method for vision transformers that effectively compresses models, leading to high accuracy on ImageNet and improved transfer learning performance, suitable for edge devices.
Contribution
It proposes a novel patch-level manifold distillation technique specifically designed for vision transformers, reducing computational costs while maintaining high accuracy.
Findings
DeiT-Tiny model achieves 76.5% top-1 accuracy on ImageNet-1k.
The method outperforms previous distillation approaches by 2.0%.
Demonstrates superior transfer learning results on various benchmarks.
Abstract
In the past few years, transformers have achieved promising performances on various computer vision tasks. Unfortunately, the immense inference overhead of most existing vision transformers withholds their from being deployed on edge devices such as cell phones and smart watches. Knowledge distillation is a widely used paradigm for compressing cumbersome architectures via transferring information to a compact student. However, most of them are designed for convolutional neural networks (CNNs), which do not fully investigate the character of vision transformer (ViT). In this paper, we utilize the patch-level information and propose a fine-grained manifold distillation method. Specifically, we train a tiny student model to match a pre-trained teacher model in the patch-level manifold space. Then, we decouple the manifold matching loss into three terms with careful design to further reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Machine Learning and ELM
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Softmax · Dense Connections · Multi-Head Attention · Vision Transformer · Knowledge Distillation
