RaZeR: Pushing the Limits of NVFP4 Quantization with Redundant Zero Remapping

Yuzong Chen; Xilai Dai; Jake Hyun; Chi-Chih Chang; Wonsuk Jang; Yuheng Wu; Thierry Tambe; Jae-sun Seo; Mohamed S. Abdelfattah

arXiv:2501.04052·cs.LG·February 3, 2026

RaZeR: Pushing the Limits of NVFP4 Quantization with Redundant Zero Remapping

Yuzong Chen, Xilai Dai, Jake Hyun, Chi-Chih Chang, Wonsuk Jang, Yuheng Wu, Thierry Tambe, Jae-sun Seo, Mohamed S. Abdelfattah

PDF

Open Access 1 Repo

TL;DR

RaZeR enhances NVFP4 quantization for large language models by utilizing redundant bits to improve accuracy and reduce perplexity loss, enabling more efficient low-precision inference.

Contribution

RaZeR introduces a novel zero remapping technique that exploits redundant bits in NVFP4 to achieve higher accuracy in LLM quantization without increasing memory footprint.

Findings

01

RaZeR reduces perplexity loss by over 30% compared to native NVFP4.

02

It enables more accurate 4-bit LLM inference with minimal hardware modifications.

03

Experimental results validate RaZeR's effectiveness across different quantization schemes.

Abstract

The recently introduced NVFP4 format demonstrates remarkable performance and memory benefits for quantized large language model (LLM) inference. However, we observe two types of redundancy in NVFP4 encoding: (1) The FP4 element format naturally exposes an unused quantization value due to its sign-magnitude representation that contains both positive and negative zeros. (2) The FP8 block scaling factor has an unused sign bit because it is always positive. Additionally, we find that LLM weights are more tolerant to a lower-precision block scaling factor. Based on these observations, we propose Redundant Zero Remapping (RaZeR), an enhanced numerical format that pushes the limits of NVFP4 for more accurate LLM quantization under the same memory footprint. RaZeR leverages the redundant bits of the block scaling factor to adaptively remap the redundant FP4 zero to additional quantization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

abdelfattah-lab/razer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training