FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization

Aotao Wang; Haikuo Shao; Shaobo Ma; Zhongfeng Wang

arXiv:2505.18975·cs.AR·July 29, 2025

FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization

Aotao Wang, Haikuo Shao, Shaobo Ma, Zhongfeng Wang

PDF

Open Access

TL;DR

FastMamba is an FPGA-based accelerator that significantly improves the deployment efficiency of Mamba2 state space models by employing hardware-algorithm co-design, accurate quantization, and optimized nonlinear function approximation.

Contribution

This paper introduces FastMamba, a novel FPGA accelerator with hardware-algorithm co-design that enables efficient, accurate quantization and processing of Mamba2 models on resource-constrained devices.

Findings

01

Achieves 68.80× speedup over CPU and 8.90× over GPU on specific tasks.

02

Attains 6× higher energy efficiency than GPU for large models.

03

Successfully implements 8-bit quantization and nonlinear function approximation.

Abstract

State Space Models (SSMs), like recent Mamba2, have achieved remarkable performance and received extensive attention. However, deploying Mamba2 on resource-constrained edge devices encounters many problems: severe outliers within the linear layer challenging the quantization, diverse and irregular element-wise tensor operations, and hardware-unfriendly nonlinear functions in the SSM block. To address these issues, this paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2. Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers. Moreover, a hardware-friendly and fine-grained power-of-two quantization framework is presented for the SSM block and convolution layer, and a first-order linear approximation is developed to optimize the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques · Low-power high-performance VLSI design · Parallel Computing and Optimization Techniques

MethodsConvolution · Linear Layer