FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization
Aotao Wang, Haikuo Shao, Shaobo Ma, Zhongfeng Wang

TL;DR
FastMamba is an FPGA-based accelerator that significantly improves the deployment efficiency of Mamba2 state space models by employing hardware-algorithm co-design, accurate quantization, and optimized nonlinear function approximation.
Contribution
This paper introduces FastMamba, a novel FPGA accelerator with hardware-algorithm co-design that enables efficient, accurate quantization and processing of Mamba2 models on resource-constrained devices.
Findings
Achieves 68.80× speedup over CPU and 8.90× over GPU on specific tasks.
Attains 6× higher energy efficiency than GPU for large models.
Successfully implements 8-bit quantization and nonlinear function approximation.
Abstract
State Space Models (SSMs), like recent Mamba2, have achieved remarkable performance and received extensive attention. However, deploying Mamba2 on resource-constrained edge devices encounters many problems: severe outliers within the linear layer challenging the quantization, diverse and irregular element-wise tensor operations, and hardware-unfriendly nonlinear functions in the SSM block. To address these issues, this paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2. Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers. Moreover, a hardware-friendly and fine-grained power-of-two quantization framework is presented for the SSM block and convolution layer, and a first-order linear approximation is developed to optimize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Low-power high-performance VLSI design · Parallel Computing and Optimization Techniques
MethodsConvolution · Linear Layer
