DeiT-LT Distillation Strikes Back for Vision Transformer Training on   Long-Tailed Datasets

Harsh Rangwani; Pradipto Mondal; Mayank Mishra; Ashish Ramayee Asokan,; R. Venkatesh Babu

arXiv:2404.02900·cs.CV·April 4, 2024·1 cites

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan,, R. Venkatesh Babu

PDF

Open Access 2 Repos

TL;DR

DeiT-LT introduces a novel distillation approach for training Vision Transformers on long-tailed datasets, improving tail class recognition by leveraging CNN-based distillation tokens and re-weighted loss functions.

Contribution

This work presents DeiT-LT, a new distillation method that enhances ViT training on imbalanced datasets by focusing on tail classes and mitigating overfitting through CNN-based distillation tokens.

Findings

01

Improved tail class accuracy on long-tailed datasets.

02

Effective training of ViT from scratch on diverse datasets.

03

Distillation tokens specialize in head and tail class features.

Abstract

Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Layer Normalization · Multi-Head Attention · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dense Connections