FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference
Fen-Yu Hsieh, Yun-Chang Teng, Ding-Yong Hong, Jan-Jan Wu

TL;DR
This paper presents an FPGA-based co-design framework that combines N:M structured pruning and quantization to significantly improve the efficiency of large language model inference across multiple hardware platforms.
Contribution
It introduces a novel hardware-software co-design method that integrates structured sparsity and quantization for FPGA accelerators, enhancing LLM inference efficiency.
Findings
Achieves up to 4x reduction in weight storage.
Realizes 1.71x speedup in matrix multiplication.
Reduces end-to-end latency by 1.29x.
Abstract
Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks. However, this success comes at the cost of substantial computation and memory requirements, which significantly impedes their deployment in resource-constrained environments. To address this challenge, this work introduces an automation framework that leverages weight pruning and low-bit quantization, and presents a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform. In particular, we implement a unified pipeline that applies N:M structured pruning and 4-bit integer quantization to reduce the memory footprint, followed by optimized dequantization and matrix multiplication to enhance LLM inference on several hardware platforms, including CPUs, NVIDIA GPUs with Dense and 2:4 Sparse Tensor Cores, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy
