GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Pengxiang Zhao; Xiaoming Yuan

arXiv:2501.12956·cs.LG·June 10, 2025

GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Pengxiang Zhao, Xiaoming Yuan

PDF

Open Access 1 Repo

TL;DR

GANQ introduces a GPU-adaptive non-uniform quantization method for large language models, significantly reducing memory and inference costs while maintaining performance, and enabling efficient deployment on standard GPUs.

Contribution

GANQ presents a novel, training-free, layer-wise non-uniform quantization framework optimized for GPU hardware, improving quantization accuracy and inference speed for LLMs.

Findings

01

Reduces perplexity gap from FP16 baseline.

02

Achieves up to 2.57× speedup on NVIDIA RTX 4090.

03

Effective for 3-bit and 4-bit quantization.

Abstract

Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

smpanaro/ganq
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques