TL;DR
MixLLM introduces a global mixed-precision quantization method for LLMs, optimizing accuracy and efficiency by selectively assigning bit-widths to output features based on their importance.
Contribution
It proposes a novel global mixed-precision quantization approach that improves accuracy and system efficiency for large language models.
Findings
Reduces perplexity increase from 0.5 to 0.2 on Llama 3.1 70B with only 10% more bits.
Improves MMLU-Pro loss from 1.92 to 0.99 over state-of-the-art models.
Achieves state-of-the-art system efficiency in LLM quantization.
Abstract
Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or low system efficiency. In this paper, we propose MixLLM that explores the optimization space of mixed-precision quantization between output features, based on the insight that different features matter differently in the model. MixLLM identifies the important output features in the global view rather than within each single layer, effectively assigning larger bit-width to output features that need it the most to achieve high accuracy and low memory usage. We present the sweet spot of quantization configuration of algorithm-system co-design with high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the Tensor Core…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
