FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
Ye Qiao, Yian Wang, Zhiheng Chen, Hyoukjun Kwon, Sitao Huang

TL;DR
FASQ introduces a calibration-free, flexible subspace quantization method for LLM compression, enabling continuous size-quality trade-offs and real-time inference on consumer GPUs.
Contribution
FASQ provides a novel calibration-free framework with continuous compression options and custom CUDA kernels, outperforming fixed-bit schemes and existing quantization methods.
Findings
FASQ achieves 37-42% model size reduction with higher accuracy than fixed schemes.
FASQ surpasses FP16 tensor-core performance in decoding speed.
FASQ enables real-time inference with 2.5-5x throughput improvements.
Abstract
Compressing large language models (LLMs) for deployment on commodity GPUs remains challenging: conventional scalar quantization is limited to fixed bit-widths (e.g., 8/4/3-bit), offers only a few discrete compression points, and typically requires calibration data. We present FASQ (Flexible Accelerated Subspace Quantization), a calibration-free framework that applies product quantization to LLM weight matrices. By tuning two parameters, sub-vector size and codebook cardinality, FASQ exposes a continuous design space spanning 27-49% of the original FP16 model size, filling compression gaps that fixed-bit schemes cannot reach. On Meta-Llama-3-8B, FASQ surpasses 4-bit GPTQ and AWQ in accuracy (67.1-67.7 avg.) at 37-42% model size, with consistent results on Qwen3-8B and Qwen3.5-9B-Base. To make product quantization practical at inference time, we design custom CUDA kernels: a LUT-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
