MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets
Bowei Zhang, Yi Zhang

TL;DR
MSCViT is a compact Vision Transformer architecture with multi-scale self-attention and convolutional features, designed specifically for small datasets, achieving high accuracy without large-scale pre-training.
Contribution
Introduces MSCViT, a parameter-efficient ViT variant with multi-scale attention and wavelet convolution, optimized for tiny datasets and reducing computational costs.
Findings
Achieves 84.68% accuracy on CIFAR-100 without pre-training.
Uses wavelet convolution for local feature extraction.
Reduces model parameters and FLOPs compared to original ViT.
Abstract
Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies. However, such success is largely fueled by training on massive samples. In real applications, the large-scale datasets are not always available, and ViT performs worse than Convolutional Neural Networks (CNNs) if it is only trained on small scale dataset (called tiny dataset), since it requires large amount of training data to ensure its representational capacity. In this paper, a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks is presented (dubbed MSCViT) to model different scales of attention at each layer. Firstly, we introduced wavelet convolution, which selectively combines the high-frequency components obtained by frequency division with our convolution channel to extract local features.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsAbsolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer
