Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing
Saarthak Kapse, Robin Betz, Srinivasan Sivanandan

TL;DR
Fast Vision Mamba introduces a novel pooling strategy in state space models that significantly accelerates vision processing, achieving up to 72.5% speedup while maintaining high accuracy across multiple tasks.
Contribution
The paper proposes Fast Vision Mamba, which reduces recurrent steps in SSM-based vision models through token pooling, leading to faster inference without performance loss.
Findings
Achieves up to 72.5% inference speedup on high-resolution images.
Maintains state-of-the-art performance across various vision tasks.
Reduces computational complexity by 2x through token pooling.
Abstract
State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from sequential steps to parallel steps with respect to the number of input tokens (). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2 reduction in the number of parallel steps in SSM block. Our model offers up to speedup in inference speed compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Robotics and Sensor-Based Localization · Digital Image Processing Techniques
