FAAR: Format-Aware Adaptive Rounding for NVFP4

Hanglin Li; Shuchang Tian; Chen Lin; Zhiyong Zhao; Kun Zhan

arXiv:2603.22370·cs.LG·March 25, 2026

FAAR: Format-Aware Adaptive Rounding for NVFP4

Hanglin Li, Shuchang Tian, Chen Lin, Zhiyong Zhao, Kun Zhan

PDF

Open Access

TL;DR

FAAR introduces a learnable, format-aware rounding method for NVFP4 quantization, significantly improving LLM performance on edge devices by reducing quantization errors and aligning model parameters with the numerical grid.

Contribution

The paper proposes FAAR, a novel adaptive rounding strategy tailored for NVFP4, and a 2-stage fine-tuning scheme, achieving superior quantization accuracy with minimal training overhead.

Findings

01

Reduces perplexity on WikiText-2 from 14.28 to 12.60 for Llama3-1B.

02

Outperforms state-of-the-art quantization methods on various downstream tasks.

03

Requires only 4 GPU hours for fine-tuning on Llama3-1B.

Abstract

Deploying large language models (LLMs) on edge devices requires extremely low-bit quantization. Ultra-low precision formats such as NVFP4 offer a promising solution for reducing memory footprint and accelerating computation. However, existing quantization methods typically rely on conventional rounding strategies and fail to account for the non-uniformity of the NVFP4 numerical grid, resulting in suboptimal rounding decisions and amplified quantization errors. To address this, we propose Format-Aware Adaptive Rounding (FAAR), a learnable rounding strategy tailored for the NVFP4 format. Unlike conventional quantization paradigms, FAAR explicitly incorporates the non-uniform NVFP4 grid into the optimization process. By adaptively adjusting rounding decisions guided by loss gradients, our method effectively approximates the theoretically optimal quantization. To complement FAAR, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Natural Language Processing Techniques