NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models
Hyochan Chong, Dongkyu Kim, Changdong Kim, Minseop Choi

TL;DR
NanoQuant introduces a novel post-training quantization method that compresses large language models to binary and sub-1-bit levels, enabling efficient deployment on consumer hardware.
Contribution
It formulates quantization as a low-rank binary factorization problem and employs an ADMM solver for precise initialization and tuning.
Findings
Compresses Llama2-70B by 25.8× in 13 hours on a single H100 GPU.
Enables large models to run on consumer 8 GB GPUs.
Establishes a new Pareto frontier in low-memory post-training quantization.
Abstract
Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) solver to precisely initialize latent binary matrices and scales, and then tunes the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Advanced Neural Network Applications · Speech Recognition and Synthesis
