Powerful Design of Small Vision Transformer on CIFAR10
Gent Wu

TL;DR
This paper investigates how to optimize small Vision Transformers for CIFAR-10, demonstrating that techniques like low-rank compression and multiple CLS tokens enhance performance with minimal complexity increase.
Contribution
It introduces a systematic framework for designing Tiny ViTs on small datasets, highlighting the effectiveness of low-rank compression and multi-class tokens.
Findings
Low-rank compression of queries causes minimal performance loss.
Multiple CLS tokens improve global representation and accuracy.
Optimized Tiny ViTs outperform baseline models on CIFAR-10.
Abstract
Vision Transformers (ViTs) have demonstrated remarkable success on large-scale datasets, but their performance on smaller datasets often falls short of convolutional neural networks (CNNs). This paper explores the design and optimization of Tiny ViTs for small datasets, using CIFAR-10 as a benchmark. We systematically evaluate the impact of data augmentation, patch token initialization, low-rank compression, and multi-class token strategies on model performance. Our experiments reveal that low-rank compression of queries in Multi-Head Latent Attention (MLA) incurs minimal performance loss, indicating redundancy in ViTs. Additionally, introducing multiple CLS tokens improves global representation capacity, boosting accuracy. These findings provide a comprehensive framework for optimizing Tiny ViTs, offering practical insights for efficient and effective designs. Code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
