MicroViT: A Vision Transformer with Low Complexity Self Attention for Edge Device

Novendra Setyawan; Chi-Chia Sun; Mao-Hsiu Hsu; Wen-Kai Kuo; Jun-Wei Hsieh

arXiv:2502.05800·cs.CV·July 3, 2025

MicroViT: A Vision Transformer with Low Complexity Self Attention for Edge Device

Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh

PDF

Open Access

TL;DR

MicroViT is a lightweight Vision Transformer designed for edge devices, utilizing an efficient attention mechanism to significantly reduce computational complexity while maintaining high accuracy, enabling practical deployment on resource-constrained hardware.

Contribution

The paper introduces MicroViT, a novel low-complexity Vision Transformer architecture with an efficient attention mechanism tailored for edge device applications.

Findings

01

Achieves competitive accuracy on ImageNet-1K and COCO datasets.

02

Increases inference speed by 3.6 times over previous models.

03

Reduces energy consumption by 40%, enhancing efficiency for edge deployment.

Abstract

The Vision Transformer (ViT) has demonstrated state-of-the-art performance in various computer vision tasks, but its high computational demands make it impractical for edge devices with limited resources. This paper presents MicroViT, a lightweight Vision Transformer architecture optimized for edge devices by significantly reducing computational complexity while maintaining high accuracy. The core of MicroViT is the Efficient Single Head Attention (ESHA) mechanism, which utilizes group convolution to reduce feature redundancy and processes only a fraction of the channels, thus lowering the burden of the self-attention mechanism. MicroViT is designed using a multi-stage MetaFormer architecture, stacking multiple MicroViT encoders to enhance efficiency and performance. Comprehensive experiments on the ImageNet-1K and COCO datasets demonstrate that MicroViT achieves competitive accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Infrared Target Detection Methodologies · Advanced Memory and Neural Computing