QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Elias Frantar, Dan Alistarh

TL;DR
QMoE introduces a novel compression framework that reduces trillion-parameter models to under 1 bit per parameter, enabling efficient, affordable inference on commodity hardware with minimal accuracy loss.
Contribution
The paper presents a scalable algorithm and GPU decoding kernels for ultra-low-bit compression of large language models, achieving 20x compression with practical inference capabilities.
Findings
Compresses 1.6 trillion parameters to 0.8 bits per parameter
Enables trillion-parameter model inference on standard GPUs
Maintains high accuracy with minimal runtime overhead
Abstract
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Topic Modeling · Speech Recognition and Synthesis
