FAAR: Format-Aware Adaptive Rounding for NVFP4
Hanglin Li, Shuchang Tian, Chen Lin, Zhiyong Zhao, Kun Zhan

TL;DR
FAAR introduces a learnable, format-aware rounding method for NVFP4 quantization, significantly improving LLM performance on edge devices by reducing quantization errors and aligning model parameters with the numerical grid.
Contribution
The paper proposes FAAR, a novel adaptive rounding strategy tailored for NVFP4, and a 2-stage fine-tuning scheme, achieving superior quantization accuracy with minimal training overhead.
Findings
Reduces perplexity on WikiText-2 from 14.28 to 12.60 for Llama3-1B.
Outperforms state-of-the-art quantization methods on various downstream tasks.
Requires only 4 GPU hours for fine-tuning on Llama3-1B.
Abstract
Deploying large language models (LLMs) on edge devices requires extremely low-bit quantization. Ultra-low precision formats such as NVFP4 offer a promising solution for reducing memory footprint and accelerating computation. However, existing quantization methods typically rely on conventional rounding strategies and fail to account for the non-uniformity of the NVFP4 numerical grid, resulting in suboptimal rounding decisions and amplified quantization errors. To address this, we propose Format-Aware Adaptive Rounding (FAAR), a learnable rounding strategy tailored for the NVFP4 format. Unlike conventional quantization paradigms, FAAR explicitly incorporates the non-uniform NVFP4 grid into the optimization process. By adaptively adjusting rounding decisions guided by loss gradients, our method effectively approximates the theoretically optimal quantization. To complement FAAR, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Natural Language Processing Techniques
