Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi; Chaoyi Jiang; Murali Annavaram

arXiv:2604.02556·cs.LG·April 6, 2026

Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi, Chaoyi Jiang, Murali Annavaram

PDF

TL;DR

This paper introduces a lightweight shared memory optimization for NF4 dequantization in LLM inference, significantly boosting kernel speed and end-to-end performance on NVIDIA GPUs.

Contribution

It presents a novel shared memory-based dequantization kernel that improves speed and efficiency, compatible with existing ecosystems and minimal engineering effort.

Findings

01

Achieved 2.0--2.2× kernel speedup over BitsAndBytes

02

Up to 1.54× end-to-end performance improvement

03

Reduced shared memory usage to 64 bytes per thread block

Abstract

Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. While NF4 (4-bit NormalFloat) quantization enables 4 $\times$ memory reduction, inference on current NVIDIA GPUs (e.g., Ampere A100) requires expensive dequantization back to FP16 format, creating a critical performance bottleneck. This paper presents a lightweight shared memory optimization that addresses this gap through principled memory hierarchy exploitation while maintaining full ecosystem compatibility. We compare our technique against the open-source BitsAndBytes implementation, achieving 2.0--2.2 $\times$ kernel speedup across three models (Gemma 27B, Qwen3 32B, and Llama3.3 70B) and up to 1.54 $\times$ end-to-end improvement by leveraging the 12--15 $\times$ latency advantage of shared memory over global memory access. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.