MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Zhen Zheng; Xiaonan Song; Chuanjie Liu

arXiv:2412.14590·cs.LG·April 23, 2026

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Zhen Zheng, Xiaonan Song, Chuanjie Liu

PDF

1 Repo

TL;DR

MixLLM introduces a global mixed-precision quantization method for LLMs, optimizing accuracy and efficiency by selectively assigning bit-widths to output features based on their importance.

Contribution

It proposes a novel global mixed-precision quantization approach that improves accuracy and system efficiency for large language models.

Findings

01

Reduces perplexity increase from 0.5 to 0.2 on Llama 3.1 70B with only 10% more bits.

02

Improves MMLU-Pro loss from 1.92 to 0.99 over state-of-the-art models.

03

Achieves state-of-the-art system efficiency in LLM quantization.

Abstract

Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or low system efficiency. In this paper, we propose MixLLM that explores the optimization space of mixed-precision quantization between output features, based on the insight that different features matter differently in the model. MixLLM identifies the important output features in the global view rather than within each single layer, effectively assigning larger bit-width to output features that need it the most to achieve high accuracy and low memory usage. We present the sweet spot of quantization configuration of algorithm-system co-design with high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the Tensor Core…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/MixLLM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.