HGQ: High Granularity Quantization for Real-time Neural Networks on FPGAs

Chang Sun; Zhiqiang Que; Thea K. {\AA}rrestad; Vladimir Loncar; Jennifer Ngadiuba; Wayne Luk; Maria Spiropulu

arXiv:2405.00645·cs.LG·December 22, 2025

HGQ: High Granularity Quantization for Real-time Neural Networks on FPGAs

Chang Sun, Zhiqiang Que, Thea K. {\AA}rrestad, Vladimir Loncar, Jennifer Ngadiuba, Wayne Luk, Maria Spiropulu

PDF

2 Repos

TL;DR

HGQ is a novel quantization-aware training framework that assigns optimal bit-widths to neural network parameters independently, enabling real-time, low-latency inference on FPGAs for critical applications.

Contribution

HGQ introduces a gradient descent-based method for per-parameter bit-width optimization, supporting heterogeneous precision and significantly reducing resource use and latency.

Findings

01

Achieves orders of magnitude reduction in resource consumption

02

Maintains accuracy across multiple benchmark tasks

03

Enables deployment of complex models in latency-critical applications

Abstract

Neural networks with sub-microsecond inference latency are required by many critical applications. Targeting such applications deployed on FPGAs, we present High Granularity Quantization (HGQ), a quantization-aware training framework that optimizes parameter bit-widths through gradient descent. Unlike conventional methods, HGQ determines the optimal bit-width for each parameter independently, making it suitable for hardware platforms supporting heterogeneous arbitrary precision arithmetic. In our experiments, HGQ shows superior performance compared to existing network compression methods, achieving orders of magnitude reduction in resource consumption and latency while maintaining the accuracy on several benchmark tasks. These improvements enable the deployment of complex models previously infeasible due to resource or latency constraints. HGQ is open-source and is used for developing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings