Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

Meng Xin; Sweta Priyadarshi; Jingyu Xin; Bilal Kartal; Aditya Vavre; Asma Kuriparambil Thekkumpate; Zijia Chen; Ameya Sunil Mahabaleshwarkar; Ido Shahaf; Akhiad Bercovich; Kinjal Patel; Suguna Varshini Velury; Chenjie Luo; Zhiyu Cheng; Jenny Chen; Chen-Han Yu; Wei Ping; Oleg Rybakov; Nima Tajbakhsh; Oluwatobi Olabiyi; Dusan Stosic; Di Wu; Song Han; Eric Chung; Sharath Turuvekere Sreenivas; Bryan Catanzaro; Yoshi Suhara; Tijmen Blankevoort; Huizi Mao

arXiv:2601.20088·cs.LG·March 4, 2026

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, Kinjal Patel, Suguna Varshini Velury, Chenjie Luo, Zhiyu Cheng, Jenny Chen, Chen-Han Yu, Wei Ping

PDF

Open Access 3 Models

TL;DR

This paper introduces quantization-aware distillation (QAD), a method that effectively recovers the accuracy of quantized large language and vision-language models, especially in complex multi-stage training pipelines, without requiring full training data.

Contribution

The paper presents QAD, a novel distillation technique that improves quantized model accuracy and stability, particularly for models trained with multi-stage pipelines, surpassing traditional quantization-aware training methods.

Findings

01

QAD achieves near-BF16 accuracy across multiple models.

02

It is robust to data quality and coverage issues.

03

QAD simplifies accuracy recovery in complex training pipelines.

Abstract

This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications