TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

Hongyaoxing Gu; Xinzhe Chen; Lijuan Hu; Fangfang Liu

arXiv:2605.09281·cs.LG·May 12, 2026

TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

Hongyaoxing Gu, Xinzhe Chen, Lijuan Hu, Fangfang Liu

PDF

TL;DR

TileQ is a novel low-rank quantization method for MoE models that reduces memory and latency without retraining, enabling efficient deployment.

Contribution

It introduces a 2D-tiling structured low-rank quantization technique and an inference method that fuses expert computations for improved efficiency.

Findings

01

Reduces memory overhead by up to 10 times.

02

Cuts inference latency to approximately 5%.

03

Maintains state-of-the-art accuracy.

Abstract

Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textsc{TileQ}, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textsc{TileQ} that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textsc{TileQ} cuts down additional memory usage up to 10 $\times$ and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.