How Lightweight Can A Vision Transformer Be

Jen Hong Tan

arXiv:2407.17783·cs.CV·July 26, 2024

How Lightweight Can A Vision Transformer Be

Jen Hong Tan

PDF

Open Access

TL;DR

This paper investigates how to make vision transformers more lightweight using a Mixture-of-Experts approach, achieving competitive performance at very small model sizes without complex mechanisms.

Contribution

It introduces a novel MoE-based design for vision transformers that reduces size and complexity while maintaining competitive accuracy, especially at very small scales.

Findings

01

Achieves competitive performance with only 0.67M parameters.

02

Effective transfer learning at small model sizes.

03

Simplifies architecture by avoiding complex attention mechanisms.

Abstract

In this paper, we explore a strategy that uses Mixture-of-Experts (MoE) to streamline, rather than augment, vision transformers. Each expert in an MoE layer is a SwiGLU feedforward network, where V and W2 are shared across the layer. No complex attention or convolutional mechanisms are employed. Depth-wise scaling is applied to progressively reduce the size of the hidden layer and the number of experts is increased in stages. Grouped query attention is used. We studied the proposed approach with and without pre-training on small datasets and investigated whether transfer learning works at this scale. We found that the architecture is competitive even at a size of 0.67M parameters.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors

MethodsSoftmax · Attention Is All You Need · Mixture of Experts · SwiGLU