FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

Ye Qiao; Yian Wang; Zhiheng Chen; Hyoukjun Kwon; Sitao Huang

arXiv:2605.04084·cs.LG·May 7, 2026

FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

Ye Qiao, Yian Wang, Zhiheng Chen, Hyoukjun Kwon, Sitao Huang

PDF

TL;DR

FASQ introduces a calibration-free, flexible subspace quantization method for LLM compression, enabling continuous size-quality trade-offs and real-time inference on consumer GPUs.

Contribution

FASQ provides a novel calibration-free framework with continuous compression options and custom CUDA kernels, outperforming fixed-bit schemes and existing quantization methods.

Findings

01

FASQ achieves 37-42% model size reduction with higher accuracy than fixed schemes.

02

FASQ surpasses FP16 tensor-core performance in decoding speed.

03

FASQ enables real-time inference with 2.5-5x throughput improvements.

Abstract

Compressing large language models (LLMs) for deployment on commodity GPUs remains challenging: conventional scalar quantization is limited to fixed bit-widths (e.g., 8/4/3-bit), offers only a few discrete compression points, and typically requires calibration data. We present FASQ (Flexible Accelerated Subspace Quantization), a calibration-free framework that applies product quantization to LLM weight matrices. By tuning two parameters, sub-vector size and codebook cardinality, FASQ exposes a continuous design space spanning 27-49% of the original FP16 model size, filling compression gaps that fixed-bit schemes cannot reach. On Meta-Llama-3-8B, FASQ surpasses 4-bit GPTQ and AWQ in accuracy (67.1-67.7 avg.) at 37-42% model size, with consistent results on Qwen3-8B and Qwen3.5-9B-Base. To make product quantization practical at inference time, we design custom CUDA kernels: a LUT-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.