TL;DR
LightMamba introduces a co-designed FPGA-based approach combining quantization and hardware optimization to accelerate Mamba state space models, achieving significant energy efficiency and speed improvements over GPU baselines.
Contribution
It presents a novel FPGA-friendly quantization method and a hardware architecture specifically optimized for Mamba inference, enabling efficient acceleration.
Findings
Achieves 4.65x to 6.06x higher energy efficiency than GPU baseline.
Reaches 93 tokens/sec on FPGA, 1.43x faster than GPU.
Reduces computation to 4-bit using rotation-assisted and power-of-two quantization.
Abstract
State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
