GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models
Pengxiang Zhao, Xiaoming Yuan

TL;DR
GANQ introduces a GPU-adaptive non-uniform quantization method for large language models, significantly reducing memory and inference costs while maintaining performance, and enabling efficient deployment on standard GPUs.
Contribution
GANQ presents a novel, training-free, layer-wise non-uniform quantization framework optimized for GPU hardware, improving quantization accuracy and inference speed for LLMs.
Findings
Reduces perplexity gap from FP16 baseline.
Achieves up to 2.57× speedup on NVIDIA RTX 4090.
Effective for 3-bit and 4-bit quantization.
Abstract
Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
