FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference

Fen-Yu Hsieh; Yun-Chang Teng; Ding-Yong Hong; Jan-Jan Wu

arXiv:2512.24713·cs.LG·January 21, 2026

FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference

Fen-Yu Hsieh, Yun-Chang Teng, Ding-Yong Hong, Jan-Jan Wu

PDF

Open Access

TL;DR

This paper presents an FPGA-based co-design framework that combines N:M structured pruning and quantization to significantly improve the efficiency of large language model inference across multiple hardware platforms.

Contribution

It introduces a novel hardware-software co-design method that integrates structured sparsity and quantization for FPGA accelerators, enhancing LLM inference efficiency.

Findings

01

Achieves up to 4x reduction in weight storage.

02

Realizes 1.71x speedup in matrix multiplication.

03

Reduces end-to-end latency by 1.29x.

Abstract

Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks. However, this success comes at the cost of substantial computation and memory requirements, which significantly impedes their deployment in resource-constrained environments. To address this challenge, this work introduces an automation framework that leverages weight pruning and low-bit quantization, and presents a hardware-software co-design method that generates accelerators on the Field-Programmable Gate Array (FPGA) platform. In particular, we implement a unified pipeline that applies N:M structured pruning and 4-bit integer quantization to reduce the memory footprint, followed by optimized dequantization and matrix multiplication to enhance LLM inference on several hardware platforms, including CPUs, NVIDIA GPUs with Dense and 2:4 Sparse Tensor Cores, and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy