How Lightweight Can A Vision Transformer Be
Jen Hong Tan

TL;DR
This paper investigates how to make vision transformers more lightweight using a Mixture-of-Experts approach, achieving competitive performance at very small model sizes without complex mechanisms.
Contribution
It introduces a novel MoE-based design for vision transformers that reduces size and complexity while maintaining competitive accuracy, especially at very small scales.
Findings
Achieves competitive performance with only 0.67M parameters.
Effective transfer learning at small model sizes.
Simplifies architecture by avoiding complex attention mechanisms.
Abstract
In this paper, we explore a strategy that uses Mixture-of-Experts (MoE) to streamline, rather than augment, vision transformers. Each expert in an MoE layer is a SwiGLU feedforward network, where V and W2 are shared across the layer. No complex attention or convolutional mechanisms are employed. Depth-wise scaling is applied to progressively reduce the size of the hidden layer and the number of experts is increased in stages. Grouped query attention is used. We studied the proposed approach with and without pre-training on small datasets and investigated whether transfer learning works at this scale. We found that the architecture is competitive even at a size of 0.67M parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors
MethodsSoftmax · Attention Is All You Need · Mixture of Experts · SwiGLU
