TL;DR
MicroViTv2 is a lightweight, energy-efficient Vision Transformer optimized for edge devices, achieving higher accuracy and efficiency through reparameterization and novel attention mechanisms.
Contribution
The paper introduces MicroViTv2, a reparameterized Vision Transformer with new modules for faster inference and improved accuracy on edge hardware.
Findings
MicroViTv2 surpasses MobileViTv2, EdgeNeXt, and EfficientViT in accuracy.
It maintains fast inference and high energy efficiency on Jetson AGX Orin.
Structural re-parameterization enhances performance beyond FLOPs considerations.
Abstract
The Vision Transformer (ViT) achieves remarkable accuracy across visual tasks but remains computationally expensive for edge deployment. This paper presents MicroViTv2, a lightweight Vision Transformer optimized for real-device efficiency. Built upon the original MicroViT, the proposed model is designed based on reparameterized design, specifically Reparameterized Patch Embedding (RepEmbed) and Reparameterized Depth-Wise convolution mixer (RepDW) for faster inference, and introduces the Single Depth-Wise Transposed Attention (SDTA) to capture long-range dependencies with minimal redundancy. Despite slightly higher FLOPs, MicroViTv2 improves accuracy up to 0.5% compared to its predecessor and surpassing MobileViTv2, EdgeNeXt, and EfficientViT while maintaining fast inference and high energy efficiency on Jetson AGX Orin. Experiments on ImageNet-1K and COCO demonstrate that hardware-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
