HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration
Rohan Juneja, Shivam Aggarwal, Safeen Huda, Tulika Mitra, Li-Shiuan Peh

TL;DR
HALO is a hardware-aware quantization framework for LLMs that optimizes weights based on circuit timing and energy profiles, significantly boosting inference speed and reducing energy use on accelerators.
Contribution
HALO introduces a hardware-aware post-training quantization method that incorporates circuit-level timing and energy considerations, enabling more efficient LLM deployment.
Findings
Achieves 270% inference speed improvement on TPUs and GPUs.
Reduces energy consumption by 51% compared to baseline methods.
Maintains high accuracy with minimal impact during optimization.
Abstract
Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the timing behaviors and energy profiles of Multiply-Accumulate (MAC) units. This disconnect from circuit-level behavior limits the ability to exploit available timing margins and energy-saving opportunities, reducing the overall efficiency of deployment on modern accelerators. To address these limitations, we propose HALO, a versatile framework for Hardware-Aware Post-Training Quantization (PTQ). Unlike traditional methods, HALO explicitly incorporates detailed hardware characteristics, including critical-path timing and power consumption, into its quantization approach. HALO strategically selects weights with low critical-path-delays enabling higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Embedded Systems Design Techniques
