EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim

TL;DR
EfficientViM introduces a novel architecture leveraging hidden state mixer-based state space duality to efficiently capture global dependencies, achieving superior speed-accuracy trade-offs on ImageNet-1k with scalable improvements.
Contribution
The paper proposes EfficientViM, a new vision model that reduces computational cost and enhances performance by redesigning the SSD layer and introducing multi-stage hidden state fusion.
Findings
Achieves up to 0.7% better accuracy than SHViT with faster speed.
Significant throughput and accuracy improvements when scaling images or using distillation.
Sets new state-of-the-art speed-accuracy trade-off on ImageNet-1k.
Abstract
For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model (SSM) has emerged as an effective operation for global interaction with its favorable linear computational cost in the number of tokens. To harness the efficacy of SSM, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. With the observation that the runtime of the SSD layer is driven by the linear projections on the input sequences, we redesign the original SSD layer to perform the channel mixing operation within compressed hidden states in the HSM-SSD layer. Additionally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and ELM · CCD and CMOS Imaging Sensors · Image Processing Techniques and Applications
MethodsSoftmax · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Depthwise Convolution · 1x1 Convolution · Convolution · SSD
