Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Lingchao Zheng; Yuwei Fan; Jun Li; Chengqiu Hu; Qichen Liao; Junyi Fan; Rui Shi; Fangzheng Miao

arXiv:2605.13915·stat.ML·May 15, 2026

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Lingchao Zheng, Yuwei Fan, Jun Li, Chengqiu Hu, Qichen Liao, Junyi Fan, Rui Shi, Fangzheng Miao

PDF

TL;DR

This paper introduces Multi-Scale Dequant (MSD), a novel quantization framework that eliminates the dequantization bottleneck in LLM inference by decomposing activations into multiple low-precision components for direct hardware-accelerated GEMM, improving efficiency without accuracy loss.

Contribution

MSD removes the dequantization step from the GEMM critical path by decomposing activations into multiple low-precision components, enabling more efficient LLM inference on modern hardware.

Findings

01

MSD achieves near 16 effective bits for INT8 weights with two-pass decomposition.

02

MSD reduces KV cache HBM traffic by up to 2.5 times in attention.

03

Numerical simulations confirm MSD maintains accuracy and reduces error.

Abstract

Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI accelerators. On architectures with decoupled compute units (e.g., Ascend NPUs), dequantization operations can consume more cycles than the matrix multiplication itself, leaving the high-throughput tensor cores underutilized. This paper presents Multi-Scale Dequant (MSD), a quantization framework that removes weight/KV dequantization from the GEMM critical path. Instead of lifting low-bit weights to BF16 precision, MSD decomposes high-precision BF16 activations into multiple low-precision components, each of which can be multiplied directly with quantized weights via native hardware-accelerated GEMM. This approach shifts the computational paradigm from precision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.