EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State   Space Duality

Sanghyeok Lee; Joonmyung Choi; Hyunwoo J. Kim

arXiv:2411.15241·cs.CV·March 25, 2025·6 cites

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim

PDF

Open Access 2 Repos 1 Models

TL;DR

EfficientViM introduces a novel architecture leveraging hidden state mixer-based state space duality to efficiently capture global dependencies, achieving superior speed-accuracy trade-offs on ImageNet-1k with scalable improvements.

Contribution

The paper proposes EfficientViM, a new vision model that reduces computational cost and enhances performance by redesigning the SSD layer and introducing multi-stage hidden state fusion.

Findings

01

Achieves up to 0.7% better accuracy than SHViT with faster speed.

02

Significant throughput and accuracy improvements when scaling images or using distillation.

03

Sets new state-of-the-art speed-accuracy trade-off on ImageNet-1k.

Abstract

For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model (SSM) has emerged as an effective operation for global interaction with its favorable linear computational cost in the number of tokens. To harness the efficacy of SSM, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. With the observation that the runtime of the SSD layer is driven by the linear projections on the input sequences, we redesign the original SSD layer to perform the channel mixing operation within compressed hidden states in the HSM-SSD layer. Additionally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
birder-project/efficientvim_m1_il-common
model· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and ELM · CCD and CMOS Imaging Sensors · Image Processing Techniques and Applications

MethodsSoftmax · Attention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Depthwise Convolution · 1x1 Convolution · Convolution · SSD