Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations
Patrick Blumenberg, Thomas Graave, Tim Fingscheidt

TL;DR
This paper introduces BOF4, a novel 4-bit block-wise quantization method for LLMs that reduces quantization errors, improves performance, and includes variations like OPQ for handling outliers, enhancing memory efficiency during fine-tuning and inference.
Contribution
The paper proposes a new optimization-based 4-bit block-wise quantizer (BOF4), a normalization modification (BOF4-S), and a mixed-precision strategy (OPQ), advancing memory-efficient LLM quantization techniques.
Findings
BOF4 reduces quantization error compared to baseline methods.
BOF4-S further decreases quantization error and preserves language model performance.
OPQ achieves top perplexity performance among 4-bit quantization methods.
Abstract
Large language models (LLMs) demand extensive memory capacity during both fine-tuning and inference. To enable memory-efficient fine-tuning, existing methods apply block-wise quantization techniques, such as NF4 and AF4, to the network weights. We show that these quantization techniques incur suboptimal quantization errors. Therefore, as a first novelty, we propose an optimization approach for block-wise quantization. Using this method, we design a family of quantizers named 4-bit block-wise optimal float (BOF4), which consistently reduces the quantization error compared to both baseline methods. We provide both a theoretical and a data-driven solution for the optimization process and prove their practical equivalence. Secondly, we propose a modification to the employed normalization method based on the signed absolute block maximum (BOF4-S), enabling further reduction of the…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed method provides an original perspective on block-wise training, addressing the often-overlooked issue of gradient inconsistency across blocks. 2. The formulation is mathematically rigorous, with theoretical justification for the proposed scheduling strategy. 3. The experiments span several model sizes and datasets, showing consistent improvement over baselines such as layer-wise and progressive tuning. 3. The approach maintains efficiency advantages (reduced memory and trainin
1. While the approach performs well on benchmark datasets, it would be useful to see how well it generalizes to non-language tasks (e.g., multimodal or code models). 2. The method involves scheduling parameters whose influence is only briefly discussed; more detailed robustness analysis would strengthen the contribution. 3. Although the method is efficient, the paper could provide clearer quantification of the additional cost introduced by the new scheduling mechanism relative to vanilla block
- **Theoretical Soundness and Novelty:** The paper's core insight—optimizing the end-to-end quantization error of the *original* weights rather than that of the *normalized* weights—is profound and well-articulated. The derivation of new centroid update rules for Lloyd's algorithm (applicable to both MSE and MAE) constitutes a solid theoretical contribution. Clear and comprehensive mathematical proofs further solidify the theoretical foundation. - **Holistic Methodological Framework:** The paper
- **Assumption of Gaussian Weight Distribution:** The optimization of the BOF4 codebook relies on the assumption that network weights are Gaussian-distributed. Although Appendix C provides justification that most blocks are indeed Gaussian, especially after OPQ, the performance on models or layers with significantly non-Gaussian weight distributions remains less explored. This could limit the generalizability to certain architectures. - **Overhead of OPQ:** Although OPQ is shown to have minimal
1. This paper provides a thorough and clear explanation of the method, making it easy for readers to understand. The experimental results are comprehensive, which strongly supports the effectiveness of the proposed method. 2. The method addresses several practical issues in current quantization algorithms, such as the wastage of reconstruction levels and the impact of outliers on quantization accuracy. By cleverly saving reconstruction levels, designing an optimized codebook algorithm, and eli
Although selecting one of the two endpoints as the reconstruction level for the maximum absolute weight provides an additional degree of freedom for the codebooks, it also requires an extra bit to store the sign of this maximum value. Is this overhead justified? Especially since, when using Llama-3.2-3B as the base model, there is no performance improvement on most tasks (Table 2).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Explainable Artificial Intelligence (XAI) · Speech Recognition and Synthesis
