Vision Transformer for Small-Size Datasets
Seung Hoon Lee, Seunghyun Lee, Byung Cheol Song

TL;DR
This paper introduces SPT and LSA modules that enhance Vision Transformers' ability to learn from small datasets by improving locality bias, leading to significant performance gains.
Contribution
The paper proposes generic SPT and LSA modules that enable ViTs to learn effectively from scratch on small datasets, a capability previously reliant on large pre-training datasets.
Findings
Performance improved by 2.96% on Tiny-ImageNet
Swin Transformer performance increased by 4.08%
Modules are easily applicable to various ViTs
Abstract
Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet, which is a representative small-size dataset. Especially, Swin…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Infrared Target Detection Methodologies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Stochastic Depth · Dropout · Label Smoothing · Byte Pair Encoding · Softmax · Swin Transformer · Dense Connections
