TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling
Hongyaoxing Gu, Xinzhe Chen, Lijuan Hu, Fangfang Liu

TL;DR
TileQ is a novel low-rank quantization method for MoE models that reduces memory and latency without retraining, enabling efficient deployment.
Contribution
It introduces a 2D-tiling structured low-rank quantization technique and an inference method that fuses expert computations for improved efficiency.
Findings
Reduces memory overhead by up to 10 times.
Cuts inference latency to approximately 5%.
Maintains state-of-the-art accuracy.
Abstract
Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textsc{TileQ}, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textsc{TileQ} that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textsc{TileQ} cuts down additional memory usage up to 10 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
