Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

Kexin Chu; Dawei Xiang; Zixu Shen; Yiwei Yang; Zecheng Liu; Wei Zhang

arXiv:2511.15015·cs.PF·February 9, 2026

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu, Wei Zhang

PDF

Open Access

TL;DR

DynaExq is a runtime-aware mixed-precision system for single-GPU MoE inference that dynamically allocates expert precision to optimize memory usage and throughput, adapting to workload hotness.

Contribution

It introduces a novel online, budget-constrained precision allocation method that improves memory efficiency and inference speed for MoE models on memory-limited GPUs.

Findings

01

Achieves up to 2.73x higher throughput than baselines.

02

Improves accuracy from 73.09% to 77.57% on Qwen3-80B.

03

Effectively manages expert hotness for dynamic precision adjustment.

Abstract

Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a single, memory-limited GPU remains difficult because expert weights dominate the HBM footprint. Existing expert offloading and prefetching systems reduce the resident set, yet they often pay expert-loading costs on the critical path when activation becomes dense. Post-training quantization (PTQ) lowers the footprint without transfers, but prevailing pipelines fix expert bit-widths offline and assume routing remains stable, even though MoE expert utilization is heavy-tailed and the hot set can shift across workloads. We present DynaExq, a runtime-aware mixed-precision serving system that treats single-GPU MoE inference under a hard HBM envelope as an online, budget-constrained precision allocation problem. The key insight is to keep…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques