Vision Transformer for Small-Size Datasets

Seung Hoon Lee; Seunghyun Lee; Byung Cheol Song

arXiv:2112.13492·cs.CV·December 28, 2021·122 cites

Vision Transformer for Small-Size Datasets

Seung Hoon Lee, Seunghyun Lee, Byung Cheol Song

PDF

Open Access 5 Repos 2 Models

TL;DR

This paper introduces SPT and LSA modules that enhance Vision Transformers' ability to learn from small datasets by improving locality bias, leading to significant performance gains.

Contribution

The paper proposes generic SPT and LSA modules that enable ViTs to learn effectively from scratch on small datasets, a capability previously reliant on large pre-training datasets.

Findings

01

Performance improved by 2.96% on Tiny-ImageNet

02

Swin Transformer performance increased by 4.08%

03

Modules are easily applicable to various ViTs

Abstract

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet, which is a representative small-size dataset. Especially, Swin…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Infrared Target Detection Methodologies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Stochastic Depth · Dropout · Label Smoothing · Byte Pair Encoding · Softmax · Swin Transformer · Dense Connections