TL;DR
This paper introduces a statistically-lossless quantization method for large language models that balances fidelity and efficiency, achieving significant compression and speedups while maintaining accuracy.
Contribution
It formalizes notions of task-lossless and distribution-lossless compression, proposes the EAR metric, and develops SLQ, a novel asymmetric quantization technique with wide bitwidth search.
Findings
Task-lossless compression achieved below 4 bits per parameter.
Distribution-lossless compression achieved at 5-6 bits per parameter.
Inference speedups of 1.7 to 3.6 times over FP16.
Abstract
Model quantization has become essential for efficient large language model deployment, yet existing approaches involve clear trade-offs: methods such as GPTQ and AWQ achieve practical compression but are lossy, while lossless techniques preserve fidelity but typically do not accelerate inference. This paper explores the middle ground of statistically-lossless compression through three complementary notions of losslessness for quantized LLMs. First, task-lossless compression preserves zero-shot benchmark accuracy within natural sampling variance and remains achievable at aggressive bitwidths. Second, we formalize the stricter notion of distribution-lossless compression, requiring the quantized model's next-token distribution to be practically indistinguishable from the original, and propose the Expected Acceptance Rate (EAR), the maximum token-agreement probability under optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
