DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge
Yuegui Huang, Zhiyuan Fang, Weiqi Luo, Ruoyu Wu, Wuhui Chen, Zibin Zheng

TL;DR
DyMoE is a dynamic mixed-precision quantization framework that significantly improves MoE inference efficiency on edge devices by intelligently prioritizing experts, adapting to depth, and prefetching to reduce latency and speed up processing.
Contribution
It introduces a novel dynamic quantization approach tailored for edge MoE inference, addressing memory and I/O bottlenecks with importance-aware and depth-adaptive strategies.
Findings
Achieves up to 22.7x reduction in Time-to-First-Token
Realizes up to 14.58x speedup in Time-Per-Output-Token
Enables real-time MoE inference on resource-constrained edge hardware
Abstract
Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Age of Information Optimization
