QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Elias Frantar; Dan Alistarh

arXiv:2310.16795·cs.LG·October 26, 2023·6 cites

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Elias Frantar, Dan Alistarh

PDF

Open Access 1 Repo

TL;DR

QMoE introduces a novel compression framework that reduces trillion-parameter models to under 1 bit per parameter, enabling efficient, affordable inference on commodity hardware with minimal accuracy loss.

Contribution

The paper presents a scalable algorithm and GPU decoding kernels for ultra-low-bit compression of large language models, achieving 20x compression with practical inference capabilities.

Findings

01

Compresses 1.6 trillion parameters to 0.8 bits per parameter

02

Enables trillion-parameter model inference on standard GPUs

03

Maintains high accuracy with minimal runtime overhead

Abstract

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ist-daslab/qmoe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Topic Modeling · Speech Recognition and Synthesis