FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization

Fangxin Liu; Zongwu Wang; JinHong Xia; Junping Zhao; Shouren Zhao; Jinjin Li; Jian Liu; Li Jiang; Haibing Guan

arXiv:2506.12024·cs.LG·October 22, 2025

FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization

Fangxin Liu, Zongwu Wang, JinHong Xia, Junping Zhao, Shouren Zhao, Jinjin Li, Jian Liu, Li Jiang, Haibing Guan

PDF

Open Access 1 Video

TL;DR

FlexQuant is a dynamic, layer-wise mixed-precision quantization framework for LLMs that adaptively switches precision during inference, significantly improving speed with minimal accuracy loss.

Contribution

It introduces a novel dynamic precision-switching framework for LLM quantization that adjusts bit-widths during inference based on model perplexity and divergence metrics.

Findings

01

Achieves 1.3x speedup in inference

02

Maintains negligible accuracy loss

03

Provides a comprehensive analysis of quantization strategies

Abstract

The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization· underline

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Neural Networks and Applications