HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

Rohan Juneja; Shivam Aggarwal; Safeen Huda; Tulika Mitra; Li-Shiuan Peh

arXiv:2502.19662·cs.AR·November 18, 2025

HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

Rohan Juneja, Shivam Aggarwal, Safeen Huda, Tulika Mitra, Li-Shiuan Peh

PDF

Open Access 1 Video

TL;DR

HALO is a hardware-aware quantization framework for LLMs that optimizes weights based on circuit timing and energy profiles, significantly boosting inference speed and reducing energy use on accelerators.

Contribution

HALO introduces a hardware-aware post-training quantization method that incorporates circuit-level timing and energy considerations, enabling more efficient LLM deployment.

Findings

01

Achieves 270% inference speed improvement on TPUs and GPUs.

02

Reduces energy consumption by 51% compared to baseline methods.

03

Maintains high accuracy with minimal impact during optimization.

Abstract

Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the timing behaviors and energy profiles of Multiply-Accumulate (MAC) units. This disconnect from circuit-level behavior limits the ability to exploit available timing margins and energy-saving opportunities, reducing the overall efficiency of deployment on modern accelerators. To address these limitations, we propose HALO, a versatile framework for Hardware-Aware Post-Training Quantization (PTQ). Unlike traditional methods, HALO explicitly incorporates detailed hardware characteristics, including critical-path timing and power consumption, into its quantization approach. HALO strategically selects weights with low critical-path-delays enabling higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

HALO: Hardware-Aware Quantization with Low Critical-Path-Delay Weights for LLM Acceleration· underline

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Embedded Systems Design Techniques