GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

Zhangyang Yao; Haiyan Zhao; Haoyu Wang; Tianbo Huang; Lihua Zhang; Xu Han

arXiv:2605.18475·cs.LG·May 19, 2026

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

Zhangyang Yao, Haiyan Zhao, Haoyu Wang, Tianbo Huang, Lihua Zhang, Xu Han

PDF

TL;DR

GAMMA is a post-training, quantizer-agnostic framework that efficiently allocates mixed-precision bits across modules of large language models, enabling high accuracy at reduced memory costs.

Contribution

It introduces a novel post-training method that learns module sensitivities and optimizes bit allocation via integer programming, avoiding costly retraining or static proxies.

Findings

01

GAMMA outperforms fixed-precision baselines by up to +12.99 in average accuracy.

02

It surpasses search-based mixed-precision methods by up to +7.00 in average accuracy.

03

GAMMA achieves fixed 3-bit quality at 2.5-bit average precision.

Abstract

Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.