MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs

Dongwei Wang; Jinhee Kim; Seokho Han; Denis Gudovskiy; Yohei Nakata; Tomoyuki Okuno; KhayTze Peong; Kang Eun Jeon; Jong Hwan Ko; Yiran Chen; Huanrui Yang

arXiv:2602.20191·cs.LG·February 25, 2026

MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs

Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang

PDF

Open Access

TL;DR

MoBiQuant introduces a token-sensitive mixture-of-bits quantization method that allows elastic LLM inference with dynamic precision adjustment, improving flexibility and performance without repeated calibration.

Contribution

It proposes a novel Mixture-of-Bits quantization framework that dynamically adjusts weight precision based on token sensitivity for elastic LLM deployment.

Findings

01

Matches performance of bit-specific PTQ on LLaMA3-8B

02

Enables smooth precision switching during inference

03

Improves generalization for token outliers

Abstract

Changing runtime complexity on cloud and edge devices necessitates elastic large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. However, it has been observed that the calibration parameters for quantization are typically linked to specific precisions, which presents challenges during elastic-precision calibration and precision switching at runtime. In this work, we attribute the source of varying calibration parameters to the varying token-level sensitivity caused by a precision-dependent outlier migration phenomenon.Motivated by this observation, we propose \texttt{MoBiQuant}, a novel Mixture-of-Bits quantization framework that adjusts weight precision for elastic LLM inference based on token sensitivity. Specifically, we propose the many-in-one recursive residual quantization that can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware-Defined Networks and 5G · Natural Language Processing Techniques · Network Packet Processing and Optimization