Boosting Entropy with Bell Box Quantization

Ningfeng Yang; Tor M. Aamodt

arXiv:2603.01599·cs.LG·March 3, 2026

Boosting Entropy with Bell Box Quantization

Ningfeng Yang, Tor M. Aamodt

PDF

Open Access 3 Reviews

TL;DR

BBQ introduces a novel information-theoretically optimal and compute-efficient quantization method that enhances neural network performance on edge devices by effectively combining domain-agnostic learning with efficient data representation.

Contribution

BBQ is the first ITO quantization method that is also compute-efficient, enabling better model performance without sacrificing efficiency.

Findings

01

Outperforms prior SOTA QAPT methods in perplexity reduction.

02

Achieves up to 18-point perplexity improvement for 1-bit models.

03

Demonstrates effectiveness across various bit-width models.

Abstract

Quantization-Aware Pre-Training (QAPT) is an effective technique to reduce the compute and memory overhead of Deep Neural Networks while improving their energy efficiency on edge devices. Existing QAPT methods produce models stored in compute-efficient data types (e.g. integers) that are not information theoretically optimal (ITO). On the other hand, existing ITO data types (e.g. Quantile/NormalFloat Quantization) are not compute-efficient. We propose BBQ, the first ITO quantization method that is also compute-efficient. BBQ builds on our key insight that since learning is domain-agnostic, the output of a quantizer does not need to reside in the same domain as its input. BBQ performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types. Without sacrificing compute efficiency, BBQ…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The paper proposes to transform the input for improved performance. This idea is novel and does lead to better-performing LLMs.

Weaknesses

1. It is not known if just transforming the input would also lead to deeper layers, also preserving the "utilize all of the quantized levels equally often" property. Though the transformer layer norm would bias it to do just that. So it's not that big of a problem.

Reviewer 02Rating 4Confidence 5

Strengths

The authors correctly note that existing QAPT methods are limited in their representation capacity, as they implicitly constrain the entropy of the quantised weights. They ground this hypothesis clearly and use it to motivate the development of the first information-theoretically optimal (ITO) quantizer for QAPT. The resulting method, BBQ, is conceptually straightforward yet effective, consistently improving over strong baselines such as QuEST and LSQ across multiple bit widths. A key strength

Weaknesses

In general, the premise of the paper is sound but requires significant work in presentation, ablations and experimental results to qualify for the conference: * The experimentation section is lacking. I would like to see zero-shot results and one more family of models, rather than just increasing the size of the same architecture. Please include a comparison to other SOTA methods, such as ParetoQ. * From a methodological standpoint, the contribution feels incremental relative to QuEST: most o

Reviewer 03Rating 6Confidence 4

Strengths

* The introduction of ITO-based quantization combined with hardware-compatible output domains is a well-motivated and original contribution to quantized training research. * The step-by-step quantization process (Hadamard Transform → Gaussian CDF → Uniform Quantization → Scaling) is well explained and systematically justified. * Across multiple model sizes and bit-widths, BBQ consistently achieves lower perplexity and higher entropy compared to LSQ and QuEST, validating the proposed method. * Us

Weaknesses

- section 2.3, "learning is domain agnostic" : the two example to support this is very simplistic assumption. For eg. rotated, cropped and color jitter image is still image and in the same domain, images transformed to frequency domain is still an image but encoded in different format. In both cases, domain is still image processing. Autoencoder is a bit closer but they can't be used to train cross domain models. Authors are suggested to add more concrete examples and possibly quantification of

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Advanced Memory and Neural Computing