HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

Dinesh Gopalan; Ratul Ali

arXiv:2602.06069·cs.DC·February 9, 2026

HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

Dinesh Gopalan, Ratul Ali

PDF

Open Access

TL;DR

This paper presents HQP, an integrated sensitivity-aware hybrid quantization and pruning framework that significantly accelerates edge AI inference while maintaining strict accuracy guarantees.

Contribution

The novel HQP framework combines sensitivity-aware pruning with post-training quantization, ensuring robust, hardware-optimized model compression for ultra-low-latency edge inference.

Findings

01

Achieves up to 3.12x inference speedup

02

Reduces model size by 55%

03

Maintains accuracy drop below 1.5%

Abstract

The escalating demand for high-fidelity, real-time inference in distributed edge-cloud environments necessitates aggressive model optimization to counteract severe latency and energy constraints. This paper introduces the Hybrid Quantization and Pruning (HQP) framework, a novel, integrated methodology designed to achieve synergistic model acceleration while adhering to strict quality guarantees. We detail a sensitivity-aware structural pruning algorithm that employs a dynamic weight sensitivity metric, derived from a highly efficient approximation of the Fisher Information Matrix (FIM), to guide the iterative removal of redundant filters. This pruning is strictly conditional, enforcing an adherence to a maximum permissible accuracy drop (Delta ax) before the model proceeds to 8-bit post-training quantization. This rigorous coordination is critical, as it ensures the resultant sparse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · IoT and Edge/Fog Computing · Domain Adaptation and Few-Shot Learning