ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA

Shengzhe Lyu; Yuhan She; Patrick S. Y. Hung; Ray C. C. Cheung; Weitao Xu

arXiv:2605.01935·cs.AR·May 5, 2026

ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA

Shengzhe Lyu, Yuhan She, Patrick S. Y. Hung, Ray C. C. Cheung, Weitao Xu

PDF

TL;DR

ViM-Q introduces a hardware-aware quantization and FPGA acceleration approach for efficient edge deployment of Vision Mamba models, achieving significant speed and energy efficiency improvements.

Contribution

The paper presents a novel co-design of quantization schemes and FPGA hardware tailored for ViM model inference, enabling scalable and efficient edge deployment.

Findings

01

4.96x speedup over GPU baseline

02

59.8x energy efficiency gain

03

Effective mitigation of activation outliers and low-bit weight quantization

Abstract

Vision Mamba (ViM) models offer a compelling efficiency advantage over Transformers by leveraging the linear complexity of State Space Models (SSMs), yet efficiently deploying them on FPGAs remains challenging. Linear layers struggle with dynamic activation outliers that render static quantization ineffective, while uniform quantization fails to capture the weight distribution at low bit-widths. Furthermore, while associative scan accelerates SSMs on GPUs, its memory access patterns are misaligned with the streaming dataflow required by FPGAs. To address these challenges, we present ViM-Q, a scalable algorithm-hardware co-design for end-to-end ViM inference on the edge. We introduce a hardware-aware quantization scheme combining dynamic per-token activation quantization and per-channel smoothing to mitigate outliers, alongside a custom 4-bit per-block Additive Power-of-Two (APoT) weight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.