MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention   Mechanism for Tiny Datasets

Bowei Zhang; Yi Zhang

arXiv:2501.06040·cs.CV·January 15, 2025

MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets

Bowei Zhang, Yi Zhang

PDF

Open Access

TL;DR

MSCViT is a compact Vision Transformer architecture with multi-scale self-attention and convolutional features, designed specifically for small datasets, achieving high accuracy without large-scale pre-training.

Contribution

Introduces MSCViT, a parameter-efficient ViT variant with multi-scale attention and wavelet convolution, optimized for tiny datasets and reducing computational costs.

Findings

01

Achieves 84.68% accuracy on CIFAR-100 without pre-training.

02

Uses wavelet convolution for local feature extraction.

03

Reduces model parameters and FLOPs compared to original ViT.

Abstract

Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies. However, such success is largely fueled by training on massive samples. In real applications, the large-scale datasets are not always available, and ViT performs worse than Convolutional Neural Networks (CNNs) if it is only trained on small scale dataset (called tiny dataset), since it requires large amount of training data to ensure its representational capacity. In this paper, a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks is presented (dubbed MSCViT) to model different scales of attention at each layer. Firstly, we introduced wavelet convolution, which selectively combines the high-frequency components obtained by frequency division with our convolution channel to extract local features.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization

MethodsAbsolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer