HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs

Guoan Wang; Feiyu Wang; Zongwei Lv; Yikun Zong; Tong Yang

arXiv:2601.20745·cs.LG·January 29, 2026

HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs

Guoan Wang, Feiyu Wang, Zongwei Lv, Yikun Zong, Tong Yang

PDF

Open Access

TL;DR

Hestia introduces a Hessian-guided differentiable quantization-aware training framework that improves extremely low-bit LLM performance by maintaining gradient flow and sensitivity-aware discretization during training.

Contribution

The paper proposes a novel Hessian-guided softmax relaxation method for low-bit quantization-aware training of LLMs, enhancing optimization and model accuracy.

Findings

01

Outperforms existing ternary QAT baselines on Llama-3.2.

02

Achieves average zero-shot improvements of 5.39% and 4.34% for 1B and 3B models.

03

Establishes a robust training path for 1.58-bit LLMs.

Abstract

As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Domain Adaptation and Few-Shot Learning