ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA
Shengzhe Lyu, Yuhan She, Patrick S. Y. Hung, Ray C. C. Cheung, Weitao Xu

TL;DR
ViM-Q introduces a hardware-aware quantization and FPGA acceleration approach for efficient edge deployment of Vision Mamba models, achieving significant speed and energy efficiency improvements.
Contribution
The paper presents a novel co-design of quantization schemes and FPGA hardware tailored for ViM model inference, enabling scalable and efficient edge deployment.
Findings
4.96x speedup over GPU baseline
59.8x energy efficiency gain
Effective mitigation of activation outliers and low-bit weight quantization
Abstract
Vision Mamba (ViM) models offer a compelling efficiency advantage over Transformers by leveraging the linear complexity of State Space Models (SSMs), yet efficiently deploying them on FPGAs remains challenging. Linear layers struggle with dynamic activation outliers that render static quantization ineffective, while uniform quantization fails to capture the weight distribution at low bit-widths. Furthermore, while associative scan accelerates SSMs on GPUs, its memory access patterns are misaligned with the streaming dataflow required by FPGAs. To address these challenges, we present ViM-Q, a scalable algorithm-hardware co-design for end-to-end ViM inference on the edge. We introduce a hardware-aware quantization scheme combining dynamic per-token activation quantization and per-channel smoothing to mitigate outliers, alongside a custom 4-bit per-block Additive Power-of-Two (APoT) weight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
