DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets
Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan,, R. Venkatesh Babu

TL;DR
DeiT-LT introduces a novel distillation approach for training Vision Transformers on long-tailed datasets, improving tail class recognition by leveraging CNN-based distillation tokens and re-weighted loss functions.
Contribution
This work presents DeiT-LT, a new distillation method that enhances ViT training on imbalanced datasets by focusing on tail classes and mitigating overfitting through CNN-based distillation tokens.
Findings
Improved tail class accuracy on long-tailed datasets.
Effective training of ViT from scratch on diverse datasets.
Distillation tokens specialize in head and tail class features.
Abstract
Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Layer Normalization · Multi-Head Attention · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dense Connections
