BAQ: Efficient Bit Allocation Quantization for Large Language Models

Chao Zhang; Li Wang; Samson Lasaulce; and Merouane Debbah

arXiv:2506.05664·cs.LG·June 9, 2025

BAQ: Efficient Bit Allocation Quantization for Large Language Models

Chao Zhang, Li Wang, Samson Lasaulce, and Merouane Debbah

PDF

Open Access 4 Reviews

TL;DR

This paper introduces BAQ, a novel bit allocation quantization method for large language models that optimally assigns bitwidths based on sensitivity, significantly reducing perplexity compared to existing methods.

Contribution

The paper presents a convex optimization-based framework for adaptive bitwidth allocation in quantization, with a closed-form solution and an efficient algorithm called BAQ.

Findings

01

BAQ outperforms GPTQ with up to 56× lower perplexity.

02

The method effectively balances loss minimization and computational complexity.

03

Experimental validation on models from 125M to 30B parameters demonstrates broad applicability.

Abstract

Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

+ The new formulation could be of interest.

Weaknesses

- Fine-grain, ungrouped mixed precision quantization is not hardware-acceleration friendly, limiting the practical value, which is not convincingly established empirically here either.

Reviewer 02Rating 2Confidence 4

Strengths

- The paper is well written, well organized, and easy to follow - The theoretical formulation to derive the optimal bitwidth allocation as a function of layer loss and sensitivity is elegant and well grounded - The investigation encompasses large scale LLMs with 7-30B parameters, highly relevant to practical workloads nowadays - An interesting link is drawn between dispersion of the sensitivity coefficients and BAQ effectiveness - The authors share their code and analytical derivations

Weaknesses

- A key limitation is that, contrary to what is stated, the technique is very _unfriendly_ to hardware as it requires assignment of independent bitwidths to each matrix column. This is not supported in today's accelerators. Running kernels where each column uses a different bitwidth would require custom per-column packing/unpacking logic, adding latency and potentially memory overhead - In the 3-bit regime, perplexity/accuracy results are comparable to GPTQ. At 2 bits, although BAQ improves sig

Reviewer 03Rating 4Confidence 5

Strengths

The method provides a theoretically grounded formulation of mixed-precision quantization through the equal-loss principle and derives an analytical optimality condition with a closed-form solution rather than relying on heuristics, making it a novel and well-established approach to bit allocation in LLM quantization.

Weaknesses

- Although bits are allocated at the column level, it is unclear how this design leads to real hardware acceleration; the approach appears focused mainly on memory footprint reduction rather than compute efficiency. Hardware constraints and deployment aspects are not discussed, leaving practical feasibility uncertain. -The comparative analysis is limited and lacks sufficient breadth. In particular, the paper omits comparisons with recent mixed-precision quantization methods that optimize under

Reviewer 04Rating 2Confidence 5

Strengths

The paper is well-written and easy to follow. The bit allocation formulation seems to be correct, and the corresponding solution (mathematical derivation) is correct.

Weaknesses

1. The major concern is that the reported performance is not good. From the results in Table 1, I could observe the improvement when GPTQ is combined with BAQ. However, the final perplexity scores are not good even for large models (e.g., Llama2-7B, Llama2-13B), when compared to the recent quantization methods such as aespa [1], AutoRound [2], BoA [3], and GPTAQ [4]. Please compare the performance with these recent baselines, and also integrate BAQ into these methods to show the validity of the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Advanced Neural Network Applications