Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu, Wei Zhang

TL;DR
DynaExq is a runtime-aware mixed-precision system for single-GPU MoE inference that dynamically allocates expert precision to optimize memory usage and throughput, adapting to workload hotness.
Contribution
It introduces a novel online, budget-constrained precision allocation method that improves memory efficiency and inference speed for MoE models on memory-limited GPUs.
Findings
Achieves up to 2.73x higher throughput than baselines.
Improves accuracy from 73.09% to 77.57% on Qwen3-80B.
Effectively manages expert hotness for dynamic precision adjustment.
Abstract
Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a single, memory-limited GPU remains difficult because expert weights dominate the HBM footprint. Existing expert offloading and prefetching systems reduce the resident set, yet they often pay expert-loading costs on the critical path when activation becomes dense. Post-training quantization (PTQ) lowers the footprint without transfers, but prevailing pipelines fix expert bit-widths offline and assume routing remains stable, even though MoE expert utilization is heavy-tailed and the hot set can shift across workloads. We present DynaExq, a runtime-aware mixed-precision serving system that treats single-GPU MoE inference under a hard HBM envelope as an online, budget-constrained precision allocation problem. The key insight is to keep…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
